Commit c4cde580 authored by David S. Miller's avatar David S. Miller

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2019-07-03

The following pull-request contains BPF updates for your *net-next* tree.

There is a minor merge conflict in mlx5 due to 8960b389 ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:

** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
  static int mlx5e_open_cq(struct mlx5e_channel *c,
                           struct dim_cq_moder moder,
                           struct mlx5e_cq_param *param,
                           struct mlx5e_cq *cq)
  =======
  int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
                    struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
  >>>>>>> e5a3e259

Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...

  drivers/net/ethernet/mellanox/mlx5/core/en.h +977

... and in mlx5e_open_xsk() ...

  drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64

... needs the same rename from net_dim_cq_moder into dim_cq_moder.

** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
          int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
          struct dim_cq_moder icocq_moder = {0, 0};
          struct net_device *netdev = priv->netdev;
          struct mlx5e_channel *c;
          unsigned int irq;
  =======
          struct net_dim_cq_moder icocq_moder = {0, 0};
  >>>>>>> e5a3e259

Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.

Let me know if you run into any issues. Anyway, the main changes are:

1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.

2) Addition of two new per-cgroup BPF hooks for getsockopt and
   setsockopt along with a new sockopt program type which allows more
   fine-grained pass/reject settings for containers. Also add a sock_ops
   callback that can be selectively enabled on a per-socket basis and is
   executed for every RTT to help tracking TCP statistics, both features
   from Stanislav.

3) Follow-up fix from loops in precision tracking which was not propagating
   precision marks and as a result verifier assumed that some branches were
   not taken and therefore wrongly removed as dead code, from Alexei.

4) Fix BPF cgroup release synchronization race which could lead to a
   double-free if a leaf's cgroup_bpf object is released and a new BPF
   program is attached to the one of ancestor cgroups in parallel, from Roman.

5) Support for bulking XDP_TX on veth devices which improves performance
   in some cases by around 9%, from Toshiaki.

6) Allow for lookups into BPF devmap and improve feedback when calling into
   bpf_redirect_map() as lookup is now performed right away in the helper
   itself, from Toke.

7) Add support for fq's Earliest Departure Time to the Host Bandwidth
   Manager (HBM) sample BPF program, from Lawrence.

8) Various cleanups and minor fixes all over the place from many others.
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents e2c74694 e5a3e259
...@@ -42,6 +42,7 @@ Program types ...@@ -42,6 +42,7 @@ Program types
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
prog_cgroup_sockopt
prog_cgroup_sysctl prog_cgroup_sysctl
prog_flow_dissector prog_flow_dissector
......
.. SPDX-License-Identifier: GPL-2.0
============================
BPF_PROG_TYPE_CGROUP_SOCKOPT
============================
``BPF_PROG_TYPE_CGROUP_SOCKOPT`` program type can be attached to two
cgroup hooks:
* ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt``
system call.
* ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt``
system call.
The context (``struct bpf_sockopt``) has associated socket (``sk``) and
all input arguments: ``level``, ``optname``, ``optval`` and ``optlen``.
BPF_CGROUP_SETSOCKOPT
=====================
``BPF_CGROUP_SETSOCKOPT`` is triggered *before* the kernel handling of
sockopt and it has writable context: it can modify the supplied arguments
before passing them down to the kernel. This hook has access to the cgroup
and socket local storage.
If BPF program sets ``optlen`` to -1, the control will be returned
back to the userspace after all other BPF programs in the cgroup
chain finish (i.e. kernel ``setsockopt`` handling will *not* be executed).
Note, that ``optlen`` can not be increased beyond the user-supplied
value. It can only be decreased or set to -1. Any other value will
trigger ``EFAULT``.
Return Type
-----------
* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
* ``1`` - success, continue with next BPF program in the cgroup chain.
BPF_CGROUP_GETSOCKOPT
=====================
``BPF_CGROUP_GETSOCKOPT`` is triggered *after* the kernel handing of
sockopt. The BPF hook can observe ``optval``, ``optlen`` and ``retval``
if it's interested in whatever kernel has returned. BPF hook can override
the values above, adjust ``optlen`` and reset ``retval`` to 0. If ``optlen``
has been increased above initial ``getsockopt`` value (i.e. userspace
buffer is too small), ``EFAULT`` is returned.
This hook has access to the cgroup and socket local storage.
Note, that the only acceptable value to set to ``retval`` is 0 and the
original value that the kernel returned. Any other value will trigger
``EFAULT``.
Return Type
-----------
* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
* ``1`` - success: copy ``optval`` and ``optlen`` to userspace, return
``retval`` from the syscall (note that this can be overwritten by
the BPF program from the parent cgroup).
Cgroup Inheritance
==================
Suppose, there is the following cgroup hierarchy where each cgroup
has ``BPF_CGROUP_GETSOCKOPT`` attached at each level with
``BPF_F_ALLOW_MULTI`` flag::
A (root, parent)
\
B (child)
When the application calls ``getsockopt`` syscall from the cgroup B,
the programs are executed from the bottom up: B, A. First program
(B) sees the result of kernel's ``getsockopt``. It can optionally
adjust ``optval``, ``optlen`` and reset ``retval`` to 0. After that
control will be passed to the second (A) program which will see the
same context as B including any potential modifications.
Same for ``BPF_CGROUP_SETSOCKOPT``: if the program is attached to
A and B, the trigger order is B, then A. If B does any changes
to the input arguments (``level``, ``optname``, ``optval``, ``optlen``),
then the next program in the chain (A) will see those changes,
*not* the original input ``setsockopt`` arguments. The potentially
modified values will be then passed down to the kernel.
Example
=======
See ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example
of BPF program that handles socket options.
...@@ -220,7 +220,21 @@ Usage ...@@ -220,7 +220,21 @@ Usage
In order to use AF_XDP sockets there are two parts needed. The In order to use AF_XDP sockets there are two parts needed. The
user-space application and the XDP program. For a complete setup and user-space application and the XDP program. For a complete setup and
usage example, please refer to the sample application. The user-space usage example, please refer to the sample application. The user-space
side is xdpsock_user.c and the XDP side xdpsock_kern.c. side is xdpsock_user.c and the XDP side is part of libbpf.
The XDP code sample included in tools/lib/bpf/xsk.c is the following::
SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
{
int index = ctx->rx_queue_index;
// A set entry here means that the correspnding queue_id
// has an active AF_XDP socket bound to it.
if (bpf_map_lookup_elem(&xsks_map, &index))
return bpf_redirect_map(&xsks_map, index, 0);
return XDP_PASS;
}
Naive ring dequeue and enqueue could look like this:: Naive ring dequeue and enqueue could look like this::
......
...@@ -641,8 +641,8 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget) ...@@ -641,8 +641,8 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
struct i40e_tx_desc *tx_desc = NULL; struct i40e_tx_desc *tx_desc = NULL;
struct i40e_tx_buffer *tx_bi; struct i40e_tx_buffer *tx_bi;
bool work_done = true; bool work_done = true;
struct xdp_desc desc;
dma_addr_t dma; dma_addr_t dma;
u32 len;
while (budget-- > 0) { while (budget-- > 0) {
if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) { if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
...@@ -651,21 +651,23 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget) ...@@ -651,21 +651,23 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
break; break;
} }
if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len)) if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc))
break; break;
dma_sync_single_for_device(xdp_ring->dev, dma, len, dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr);
dma_sync_single_for_device(xdp_ring->dev, dma, desc.len,
DMA_BIDIRECTIONAL); DMA_BIDIRECTIONAL);
tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use]; tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
tx_bi->bytecount = len; tx_bi->bytecount = desc.len;
tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use); tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
tx_desc->buffer_addr = cpu_to_le64(dma); tx_desc->buffer_addr = cpu_to_le64(dma);
tx_desc->cmd_type_offset_bsz = tx_desc->cmd_type_offset_bsz =
build_ctob(I40E_TX_DESC_CMD_ICRC build_ctob(I40E_TX_DESC_CMD_ICRC
| I40E_TX_DESC_CMD_EOP, | I40E_TX_DESC_CMD_EOP,
0, len, 0); 0, desc.len, 0);
xdp_ring->next_to_use++; xdp_ring->next_to_use++;
if (xdp_ring->next_to_use == xdp_ring->count) if (xdp_ring->next_to_use == xdp_ring->count)
......
...@@ -571,8 +571,9 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget) ...@@ -571,8 +571,9 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
union ixgbe_adv_tx_desc *tx_desc = NULL; union ixgbe_adv_tx_desc *tx_desc = NULL;
struct ixgbe_tx_buffer *tx_bi; struct ixgbe_tx_buffer *tx_bi;
bool work_done = true; bool work_done = true;
u32 len, cmd_type; struct xdp_desc desc;
dma_addr_t dma; dma_addr_t dma;
u32 cmd_type;
while (budget-- > 0) { while (budget-- > 0) {
if (unlikely(!ixgbe_desc_unused(xdp_ring)) || if (unlikely(!ixgbe_desc_unused(xdp_ring)) ||
...@@ -581,14 +582,16 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget) ...@@ -581,14 +582,16 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
break; break;
} }
if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len)) if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc))
break; break;
dma_sync_single_for_device(xdp_ring->dev, dma, len, dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr);
dma_sync_single_for_device(xdp_ring->dev, dma, desc.len,
DMA_BIDIRECTIONAL); DMA_BIDIRECTIONAL);
tx_bi = &xdp_ring->tx_buffer_info[xdp_ring->next_to_use]; tx_bi = &xdp_ring->tx_buffer_info[xdp_ring->next_to_use];
tx_bi->bytecount = len; tx_bi->bytecount = desc.len;
tx_bi->xdpf = NULL; tx_bi->xdpf = NULL;
tx_bi->gso_segs = 1; tx_bi->gso_segs = 1;
...@@ -599,10 +602,10 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget) ...@@ -599,10 +602,10 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
cmd_type = IXGBE_ADVTXD_DTYP_DATA | cmd_type = IXGBE_ADVTXD_DTYP_DATA |
IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DCMD_DEXT |
IXGBE_ADVTXD_DCMD_IFCS; IXGBE_ADVTXD_DCMD_IFCS;
cmd_type |= len | IXGBE_TXD_CMD; cmd_type |= desc.len | IXGBE_TXD_CMD;
tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type); tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);
tx_desc->read.olinfo_status = tx_desc->read.olinfo_status =
cpu_to_le32(len << IXGBE_ADVTXD_PAYLEN_SHIFT); cpu_to_le32(desc.len << IXGBE_ADVTXD_PAYLEN_SHIFT);
xdp_ring->next_to_use++; xdp_ring->next_to_use++;
if (xdp_ring->next_to_use == xdp_ring->count) if (xdp_ring->next_to_use == xdp_ring->count)
......
...@@ -24,7 +24,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \ ...@@ -24,7 +24,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \ mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \ en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \
en_selftest.o en/port.o en/monitor_stats.o en/reporter_tx.o \ en_selftest.o en/port.o en/monitor_stats.o en/reporter_tx.o \
en/params.o en/params.o en/xsk/umem.o en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o
# #
# Netdev extra # Netdev extra
......
...@@ -3,65 +3,102 @@ ...@@ -3,65 +3,102 @@
#include "en/params.h" #include "en/params.h"
u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params) static inline bool mlx5e_rx_is_xdp(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{ {
u16 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); return params->xdp_prog || xsk;
u16 linear_rq_headroom = params->xdp_prog ? }
XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM;
u32 frag_sz; u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{
u16 headroom = NET_IP_ALIGN;
if (mlx5e_rx_is_xdp(params, xsk)) {
headroom += XDP_PACKET_HEADROOM;
if (xsk)
headroom += xsk->headroom;
} else {
headroom += MLX5_RX_HEADROOM;
}
return headroom;
}
u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{
u32 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
u16 linear_rq_headroom = mlx5e_get_linear_rq_headroom(params, xsk);
u32 frag_sz = linear_rq_headroom + hw_mtu;
linear_rq_headroom += NET_IP_ALIGN; /* AF_XDP doesn't build SKBs in place. */
if (!xsk)
frag_sz = MLX5_SKB_FRAG_SZ(frag_sz);
frag_sz = MLX5_SKB_FRAG_SZ(linear_rq_headroom + hw_mtu); /* XDP in mlx5e doesn't support multiple packets per page. */
if (mlx5e_rx_is_xdp(params, xsk))
frag_sz = max_t(u32, frag_sz, PAGE_SIZE);
if (params->xdp_prog && frag_sz < PAGE_SIZE) /* Even if we can go with a smaller fragment size, we must not put
frag_sz = PAGE_SIZE; * multiple packets into a single frame.
*/
if (xsk)
frag_sz = max_t(u32, frag_sz, xsk->chunk_size);
return frag_sz; return frag_sz;
} }
u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params) u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{ {
u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params); u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params, xsk);
return MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(linear_frag_sz); return MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(linear_frag_sz);
} }
bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params) bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{ {
u32 frag_sz = mlx5e_rx_get_linear_frag_sz(params); /* AF_XDP allocates SKBs on XDP_PASS - ensure they don't occupy more
* than one page. For this, check both with and without xsk.
*/
u32 linear_frag_sz = max(mlx5e_rx_get_linear_frag_sz(params, xsk),
mlx5e_rx_get_linear_frag_sz(params, NULL));
return !params->lro_en && frag_sz <= PAGE_SIZE; return !params->lro_en && linear_frag_sz <= PAGE_SIZE;
} }
#define MLX5_MAX_MPWQE_LOG_WQE_STRIDE_SZ ((BIT(__mlx5_bit_sz(wq, log_wqe_stride_size)) - 1) + \ #define MLX5_MAX_MPWQE_LOG_WQE_STRIDE_SZ ((BIT(__mlx5_bit_sz(wq, log_wqe_stride_size)) - 1) + \
MLX5_MPWQE_LOG_STRIDE_SZ_BASE) MLX5_MPWQE_LOG_STRIDE_SZ_BASE)
bool mlx5e_rx_mpwqe_is_linear_skb(struct mlx5_core_dev *mdev, bool mlx5e_rx_mpwqe_is_linear_skb(struct mlx5_core_dev *mdev,
struct mlx5e_params *params) struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{ {
u32 frag_sz = mlx5e_rx_get_linear_frag_sz(params); u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params, xsk);
s8 signed_log_num_strides_param; s8 signed_log_num_strides_param;
u8 log_num_strides; u8 log_num_strides;
if (!mlx5e_rx_is_linear_skb(params)) if (!mlx5e_rx_is_linear_skb(params, xsk))
return false; return false;
if (order_base_2(frag_sz) > MLX5_MAX_MPWQE_LOG_WQE_STRIDE_SZ) if (order_base_2(linear_frag_sz) > MLX5_MAX_MPWQE_LOG_WQE_STRIDE_SZ)
return false; return false;
if (MLX5_CAP_GEN(mdev, ext_stride_num_range)) if (MLX5_CAP_GEN(mdev, ext_stride_num_range))
return true; return true;
log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(frag_sz); log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(linear_frag_sz);
signed_log_num_strides_param = signed_log_num_strides_param =
(s8)log_num_strides - MLX5_MPWQE_LOG_NUM_STRIDES_BASE; (s8)log_num_strides - MLX5_MPWQE_LOG_NUM_STRIDES_BASE;
return signed_log_num_strides_param >= 0; return signed_log_num_strides_param >= 0;
} }
u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params) u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{ {
u8 log_pkts_per_wqe = mlx5e_mpwqe_log_pkts_per_wqe(params); u8 log_pkts_per_wqe = mlx5e_mpwqe_log_pkts_per_wqe(params, xsk);
/* Numbers are unsigned, don't subtract to avoid underflow. */ /* Numbers are unsigned, don't subtract to avoid underflow. */
if (params->log_rq_mtu_frames < if (params->log_rq_mtu_frames <
...@@ -72,33 +109,30 @@ u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params) ...@@ -72,33 +109,30 @@ u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params)
} }
u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev, u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev,
struct mlx5e_params *params) struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{ {
if (mlx5e_rx_mpwqe_is_linear_skb(mdev, params)) if (mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk))
return order_base_2(mlx5e_rx_get_linear_frag_sz(params)); return order_base_2(mlx5e_rx_get_linear_frag_sz(params, xsk));
return MLX5_MPWRQ_DEF_LOG_STRIDE_SZ(mdev); return MLX5_MPWRQ_DEF_LOG_STRIDE_SZ(mdev);
} }
u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev, u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev,
struct mlx5e_params *params) struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{ {
return MLX5_MPWRQ_LOG_WQE_SZ - return MLX5_MPWRQ_LOG_WQE_SZ -
mlx5e_mpwqe_get_log_stride_size(mdev, params); mlx5e_mpwqe_get_log_stride_size(mdev, params, xsk);
} }
u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev, u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev,
struct mlx5e_params *params) struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk)
{ {
u16 linear_rq_headroom = params->xdp_prog ? bool is_linear_skb = (params->rq_wq_type == MLX5_WQ_TYPE_CYCLIC) ?
XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM; mlx5e_rx_is_linear_skb(params, xsk) :
bool is_linear_skb; mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk);
linear_rq_headroom += NET_IP_ALIGN;
is_linear_skb = (params->rq_wq_type == MLX5_WQ_TYPE_CYCLIC) ?
mlx5e_rx_is_linear_skb(params) :
mlx5e_rx_mpwqe_is_linear_skb(mdev, params);
return is_linear_skb ? linear_rq_headroom : 0; return is_linear_skb ? mlx5e_get_linear_rq_headroom(params, xsk) : 0;
} }
...@@ -6,17 +6,119 @@ ...@@ -6,17 +6,119 @@
#include "en.h" #include "en.h"
u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params); struct mlx5e_xsk_param {
u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params); u16 headroom;
bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params); u16 chunk_size;
};
struct mlx5e_rq_param {
u32 rqc[MLX5_ST_SZ_DW(rqc)];
struct mlx5_wq_param wq;
struct mlx5e_rq_frags_info frags_info;
};
struct mlx5e_sq_param {
u32 sqc[MLX5_ST_SZ_DW(sqc)];
struct mlx5_wq_param wq;
bool is_mpw;
};
struct mlx5e_cq_param {
u32 cqc[MLX5_ST_SZ_DW(cqc)];
struct mlx5_wq_param wq;
u16 eq_ix;
u8 cq_period_mode;
};
struct mlx5e_channel_param {
struct mlx5e_rq_param rq;
struct mlx5e_sq_param sq;
struct mlx5e_sq_param xdp_sq;
struct mlx5e_sq_param icosq;
struct mlx5e_cq_param rx_cq;
struct mlx5e_cq_param tx_cq;
struct mlx5e_cq_param icosq_cq;
};
static inline bool mlx5e_qid_get_ch_if_in_group(struct mlx5e_params *params,
u16 qid,
enum mlx5e_rq_group group,
u16 *ix)
{
int nch = params->num_channels;
int ch = qid - nch * group;
if (ch < 0 || ch >= nch)
return false;
*ix = ch;
return true;
}
static inline void mlx5e_qid_get_ch_and_group(struct mlx5e_params *params,
u16 qid,
u16 *ix,
enum mlx5e_rq_group *group)
{
u16 nch = params->num_channels;
*ix = qid % nch;
*group = qid / nch;
}
static inline bool mlx5e_qid_validate(struct mlx5e_params *params, u64 qid)
{
return qid < params->num_channels * MLX5E_NUM_RQ_GROUPS;
}
/* Parameter calculations */
u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
bool mlx5e_rx_mpwqe_is_linear_skb(struct mlx5_core_dev *mdev, bool mlx5e_rx_mpwqe_is_linear_skb(struct mlx5_core_dev *mdev,
struct mlx5e_params *params); struct mlx5e_params *params,
u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params); struct mlx5e_xsk_param *xsk);
u8 mlx5e_mpwqe_get_log_rq_size(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev, u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev,
struct mlx5e_params *params); struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev, u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev,
struct mlx5e_params *params); struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev, u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev,
struct mlx5e_params *params); struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk);
/* Build queue parameters */
void mlx5e_build_rq_param(struct mlx5e_priv *priv,
struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk,
struct mlx5e_rq_param *param);
void mlx5e_build_sq_param_common(struct mlx5e_priv *priv,
struct mlx5e_sq_param *param);
void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv,
struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk,
struct mlx5e_cq_param *param);
void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv,
struct mlx5e_params *params,
struct mlx5e_cq_param *param);
void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv,
u8 log_wq_size,
struct mlx5e_cq_param *param);
void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
u8 log_wq_size,
struct mlx5e_sq_param *param);
void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv,
struct mlx5e_params *params,
struct mlx5e_sq_param *param);
#endif /* __MLX5_EN_PARAMS_H__ */ #endif /* __MLX5_EN_PARAMS_H__ */
...@@ -39,11 +39,13 @@ ...@@ -39,11 +39,13 @@
(sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS) (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
#define MLX5E_XDP_TX_DS_COUNT (MLX5E_XDP_TX_EMPTY_DS_COUNT + 1 /* SG DS */) #define MLX5E_XDP_TX_DS_COUNT (MLX5E_XDP_TX_EMPTY_DS_COUNT + 1 /* SG DS */)
int mlx5e_xdp_max_mtu(struct mlx5e_params *params); struct mlx5e_xsk_param;
int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk);
bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di, bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
void *va, u16 *rx_headroom, u32 *len); void *va, u16 *rx_headroom, u32 *len, bool xsk);
bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq); void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq);
void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq); bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq);
void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq);
void mlx5e_set_xmit_fp(struct mlx5e_xdpsq *sq, bool is_mpw); void mlx5e_set_xmit_fp(struct mlx5e_xdpsq *sq, bool is_mpw);
void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq); void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq);
int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames, int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
...@@ -66,6 +68,21 @@ static inline bool mlx5e_xdp_tx_is_enabled(struct mlx5e_priv *priv) ...@@ -66,6 +68,21 @@ static inline bool mlx5e_xdp_tx_is_enabled(struct mlx5e_priv *priv)
return test_bit(MLX5E_STATE_XDP_TX_ENABLED, &priv->state); return test_bit(MLX5E_STATE_XDP_TX_ENABLED, &priv->state);
} }
static inline void mlx5e_xdp_set_open(struct mlx5e_priv *priv)
{
set_bit(MLX5E_STATE_XDP_OPEN, &priv->state);
}
static inline void mlx5e_xdp_set_closed(struct mlx5e_priv *priv)
{
clear_bit(MLX5E_STATE_XDP_OPEN, &priv->state);
}
static inline bool mlx5e_xdp_is_open(struct mlx5e_priv *priv)
{
return test_bit(MLX5E_STATE_XDP_OPEN, &priv->state);
}
static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq) static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq)
{ {
if (sq->doorbell_cseg) { if (sq->doorbell_cseg) {
...@@ -97,15 +114,14 @@ static inline void mlx5e_xdp_update_inline_state(struct mlx5e_xdpsq *sq) ...@@ -97,15 +114,14 @@ static inline void mlx5e_xdp_update_inline_state(struct mlx5e_xdpsq *sq)
} }
static inline void static inline void
mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi, mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq,
struct mlx5e_xdp_xmit_data *xdptxd,
struct mlx5e_xdpsq_stats *stats) struct mlx5e_xdpsq_stats *stats)
{ {
struct mlx5e_xdp_mpwqe *session = &sq->mpwqe; struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
dma_addr_t dma_addr = xdpi->dma_addr;
struct xdp_frame *xdpf = xdpi->xdpf;
struct mlx5_wqe_data_seg *dseg = struct mlx5_wqe_data_seg *dseg =
(struct mlx5_wqe_data_seg *)session->wqe + session->ds_count; (struct mlx5_wqe_data_seg *)session->wqe + session->ds_count;
u16 dma_len = xdpf->len; u32 dma_len = xdptxd->len;
session->pkt_count++; session->pkt_count++;
...@@ -124,7 +140,7 @@ mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi, ...@@ -124,7 +140,7 @@ mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi,
} }
inline_dseg->byte_count = cpu_to_be32(dma_len | MLX5_INLINE_SEG); inline_dseg->byte_count = cpu_to_be32(dma_len | MLX5_INLINE_SEG);
memcpy(inline_dseg->data, xdpf->data, dma_len); memcpy(inline_dseg->data, xdptxd->data, dma_len);
session->ds_count += ds_cnt; session->ds_count += ds_cnt;
stats->inlnw++; stats->inlnw++;
...@@ -132,7 +148,7 @@ mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi, ...@@ -132,7 +148,7 @@ mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi,
} }
no_inline: no_inline:
dseg->addr = cpu_to_be64(dma_addr); dseg->addr = cpu_to_be64(xdptxd->dma_addr);
dseg->byte_count = cpu_to_be32(dma_len); dseg->byte_count = cpu_to_be32(dma_len);
dseg->lkey = sq->mkey_be; dseg->lkey = sq->mkey_be;
session->ds_count++; session->ds_count++;
......
// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
/* Copyright (c) 2019 Mellanox Technologies. */
#include "rx.h"
#include "en/xdp.h"
#include <net/xdp_sock.h>
/* RX data path */
bool mlx5e_xsk_pages_enough_umem(struct mlx5e_rq *rq, int count)
{
/* Check in advance that we have enough frames, instead of allocating
* one-by-one, failing and moving frames to the Reuse Ring.
*/
return xsk_umem_has_addrs_rq(rq->umem, count);
}
int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info)
{
struct xdp_umem *umem = rq->umem;
u64 handle;
if (!xsk_umem_peek_addr_rq(umem, &handle))
return -ENOMEM;
dma_info->xsk.handle = handle + rq->buff.umem_headroom;
dma_info->xsk.data = xdp_umem_get_data(umem, dma_info->xsk.handle);
/* No need to add headroom to the DMA address. In striding RQ case, we
* just provide pages for UMR, and headroom is counted at the setup
* stage when creating a WQE. In non-striding RQ case, headroom is
* accounted in mlx5e_alloc_rx_wqe.
*/
dma_info->addr = xdp_umem_get_dma(umem, handle);
xsk_umem_discard_addr_rq(umem);
dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
DMA_BIDIRECTIONAL);
return 0;
}
static inline void mlx5e_xsk_recycle_frame(struct mlx5e_rq *rq, u64 handle)
{
xsk_umem_fq_reuse(rq->umem, handle & rq->umem->chunk_mask);
}
/* XSKRQ uses pages from UMEM, they must not be released. They are returned to
* the userspace if possible, and if not, this function is called to reuse them
* in the driver.
*/
void mlx5e_xsk_page_release(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info)
{
mlx5e_xsk_recycle_frame(rq, dma_info->xsk.handle);
}
/* Return a frame back to the hardware to fill in again. It is used by XDP when
* the XDP program returns XDP_TX or XDP_REDIRECT not to an XSKMAP.
*/
void mlx5e_xsk_zca_free(struct zero_copy_allocator *zca, unsigned long handle)
{
struct mlx5e_rq *rq = container_of(zca, struct mlx5e_rq, zca);
mlx5e_xsk_recycle_frame(rq, handle);
}
static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, void *data,
u32 cqe_bcnt)
{
struct sk_buff *skb;
skb = napi_alloc_skb(rq->cq.napi, cqe_bcnt);
if (unlikely(!skb)) {
rq->stats->buff_alloc_err++;
return NULL;
}
skb_put_data(skb, data, cqe_bcnt);
return skb;
}
struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
struct mlx5e_mpw_info *wi,
u16 cqe_bcnt,
u32 head_offset,
u32 page_idx)
{
struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
u16 rx_headroom = rq->buff.headroom - rq->buff.umem_headroom;
u32 cqe_bcnt32 = cqe_bcnt;
void *va, *data;
u32 frag_size;
bool consumed;
/* Check packet size. Note LRO doesn't use linear SKB */
if (unlikely(cqe_bcnt > rq->hw_mtu)) {
rq->stats->oversize_pkts_sw_drop++;
return NULL;
}
/* head_offset is not used in this function, because di->xsk.data and
* di->addr point directly to the necessary place. Furthermore, in the
* current implementation, one page = one packet = one frame, so
* head_offset should always be 0.
*/
WARN_ON_ONCE(head_offset);
va = di->xsk.data;
data = va + rx_headroom;
frag_size = rq->buff.headroom + cqe_bcnt32;
dma_sync_single_for_cpu(rq->pdev, di->addr, frag_size, DMA_BIDIRECTIONAL);
prefetch(data);
rcu_read_lock();
consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, true);
rcu_read_unlock();
/* Possible flows:
* - XDP_REDIRECT to XSKMAP:
* The page is owned by the userspace from now.
* - XDP_TX and other XDP_REDIRECTs:
* The page was returned by ZCA and recycled.
* - XDP_DROP:
* Recycle the page.
* - XDP_PASS:
* Allocate an SKB, copy the data and recycle the page.
*
* Pages to be recycled go to the Reuse Ring on MPWQE deallocation. Its
* size is the same as the Driver RX Ring's size, and pages for WQEs are
* allocated first from the Reuse Ring, so it has enough space.
*/
if (likely(consumed)) {
if (likely(__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)))
__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
return NULL; /* page/packet was consumed by XDP */
}
/* XDP_PASS: copy the data from the UMEM to a new SKB and reuse the
* frame. On SKB allocation failure, NULL is returned.
*/
return mlx5e_xsk_construct_skb(rq, data, cqe_bcnt32);
}
struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
struct mlx5_cqe64 *cqe,
struct mlx5e_wqe_frag_info *wi,
u32 cqe_bcnt)
{
struct mlx5e_dma_info *di = wi->di;
u16 rx_headroom = rq->buff.headroom - rq->buff.umem_headroom;
void *va, *data;
bool consumed;
u32 frag_size;
/* wi->offset is not used in this function, because di->xsk.data and
* di->addr point directly to the necessary place. Furthermore, in the
* current implementation, one page = one packet = one frame, so
* wi->offset should always be 0.
*/
WARN_ON_ONCE(wi->offset);
va = di->xsk.data;
data = va + rx_headroom;
frag_size = rq->buff.headroom + cqe_bcnt;
dma_sync_single_for_cpu(rq->pdev, di->addr, frag_size, DMA_BIDIRECTIONAL);
prefetch(data);
if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_RESP_SEND)) {
rq->stats->wqe_err++;
return NULL;
}
rcu_read_lock();
consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, true);
rcu_read_unlock();
if (likely(consumed))
return NULL; /* page/packet was consumed by XDP */
/* XDP_PASS: copy the data from the UMEM to a new SKB. The frame reuse
* will be handled by mlx5e_put_rx_frag.
* On SKB allocation failure, NULL is returned.
*/
return mlx5e_xsk_construct_skb(rq, data, cqe_bcnt);
}
/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
/* Copyright (c) 2019 Mellanox Technologies. */
#ifndef __MLX5_EN_XSK_RX_H__
#define __MLX5_EN_XSK_RX_H__
#include "en.h"
/* RX data path */
bool mlx5e_xsk_pages_enough_umem(struct mlx5e_rq *rq, int count);
int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info);
void mlx5e_xsk_page_release(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info);
void mlx5e_xsk_zca_free(struct zero_copy_allocator *zca, unsigned long handle);
struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
struct mlx5e_mpw_info *wi,
u16 cqe_bcnt,
u32 head_offset,
u32 page_idx);
struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
struct mlx5_cqe64 *cqe,
struct mlx5e_wqe_frag_info *wi,
u32 cqe_bcnt);
#endif /* __MLX5_EN_XSK_RX_H__ */
// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
/* Copyright (c) 2019 Mellanox Technologies. */
#include "setup.h"
#include "en/params.h"
bool mlx5e_validate_xsk_param(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk,
struct mlx5_core_dev *mdev)
{
/* AF_XDP doesn't support frames larger than PAGE_SIZE, and the current
* mlx5e XDP implementation doesn't support multiple packets per page.
*/
if (xsk->chunk_size != PAGE_SIZE)
return false;
/* Current MTU and XSK headroom don't allow packets to fit the frames. */
if (mlx5e_rx_get_linear_frag_sz(params, xsk) > xsk->chunk_size)
return false;
/* frag_sz is different for regular and XSK RQs, so ensure that linear
* SKB mode is possible.
*/
switch (params->rq_wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
return mlx5e_rx_mpwqe_is_linear_skb(mdev, params, xsk);
default: /* MLX5_WQ_TYPE_CYCLIC */
return mlx5e_rx_is_linear_skb(params, xsk);
}
}
static void mlx5e_build_xskicosq_param(struct mlx5e_priv *priv,
u8 log_wq_size,
struct mlx5e_sq_param *param)
{
void *sqc = param->sqc;
void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
mlx5e_build_sq_param_common(priv, param);
MLX5_SET(wq, wq, log_wq_sz, log_wq_size);
}
static void mlx5e_build_xsk_cparam(struct mlx5e_priv *priv,
struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk,
struct mlx5e_channel_param *cparam)
{
const u8 xskicosq_size = MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE;
mlx5e_build_rq_param(priv, params, xsk, &cparam->rq);
mlx5e_build_xdpsq_param(priv, params, &cparam->xdp_sq);
mlx5e_build_xskicosq_param(priv, xskicosq_size, &cparam->icosq);
mlx5e_build_rx_cq_param(priv, params, xsk, &cparam->rx_cq);
mlx5e_build_tx_cq_param(priv, params, &cparam->tx_cq);
mlx5e_build_ico_cq_param(priv, xskicosq_size, &cparam->icosq_cq);
}
int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk, struct xdp_umem *umem,
struct mlx5e_channel *c)
{
struct mlx5e_channel_param cparam = {};
struct dim_cq_moder icocq_moder = {};
int err;
if (!mlx5e_validate_xsk_param(params, xsk, priv->mdev))
return -EINVAL;
mlx5e_build_xsk_cparam(priv, params, xsk, &cparam);
err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam.rx_cq, &c->xskrq.cq);
if (unlikely(err))
return err;
err = mlx5e_open_rq(c, params, &cparam.rq, xsk, umem, &c->xskrq);
if (unlikely(err))
goto err_close_rx_cq;
err = mlx5e_open_cq(c, params->tx_cq_moderation, &cparam.tx_cq, &c->xsksq.cq);
if (unlikely(err))
goto err_close_rq;
/* Create a separate SQ, so that when the UMEM is disabled, we could
* close this SQ safely and stop receiving CQEs. In other case, e.g., if
* the XDPSQ was used instead, we might run into trouble when the UMEM
* is disabled and then reenabled, but the SQ continues receiving CQEs
* from the old UMEM.
*/
err = mlx5e_open_xdpsq(c, params, &cparam.xdp_sq, umem, &c->xsksq, true);
if (unlikely(err))
goto err_close_tx_cq;
err = mlx5e_open_cq(c, icocq_moder, &cparam.icosq_cq, &c->xskicosq.cq);
if (unlikely(err))
goto err_close_sq;
/* Create a dedicated SQ for posting NOPs whenever we need an IRQ to be
* triggered and NAPI to be called on the correct CPU.
*/
err = mlx5e_open_icosq(c, params, &cparam.icosq, &c->xskicosq);
if (unlikely(err))
goto err_close_icocq;
spin_lock_init(&c->xskicosq_lock);
set_bit(MLX5E_CHANNEL_STATE_XSK, c->state);
return 0;
err_close_icocq:
mlx5e_close_cq(&c->xskicosq.cq);
err_close_sq:
mlx5e_close_xdpsq(&c->xsksq);
err_close_tx_cq:
mlx5e_close_cq(&c->xsksq.cq);
err_close_rq:
mlx5e_close_rq(&c->xskrq);
err_close_rx_cq:
mlx5e_close_cq(&c->xskrq.cq);
return err;
}
void mlx5e_close_xsk(struct mlx5e_channel *c)
{
clear_bit(MLX5E_CHANNEL_STATE_XSK, c->state);
napi_synchronize(&c->napi);
mlx5e_close_rq(&c->xskrq);
mlx5e_close_cq(&c->xskrq.cq);
mlx5e_close_icosq(&c->xskicosq);
mlx5e_close_cq(&c->xskicosq.cq);
mlx5e_close_xdpsq(&c->xsksq);
mlx5e_close_cq(&c->xsksq.cq);
}
void mlx5e_activate_xsk(struct mlx5e_channel *c)
{
set_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state);
/* TX queue is created active. */
mlx5e_trigger_irq(&c->xskicosq);
}
void mlx5e_deactivate_xsk(struct mlx5e_channel *c)
{
mlx5e_deactivate_rq(&c->xskrq);
/* TX queue is disabled on close. */
}
static int mlx5e_redirect_xsk_rqt(struct mlx5e_priv *priv, u16 ix, u32 rqn)
{
struct mlx5e_redirect_rqt_param direct_rrp = {
.is_rss = false,
{
.rqn = rqn,
},
};
u32 rqtn = priv->xsk_tir[ix].rqt.rqtn;
return mlx5e_redirect_rqt(priv, rqtn, 1, direct_rrp);
}
int mlx5e_xsk_redirect_rqt_to_channel(struct mlx5e_priv *priv, struct mlx5e_channel *c)
{
return mlx5e_redirect_xsk_rqt(priv, c->ix, c->xskrq.rqn);
}
int mlx5e_xsk_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix)
{
return mlx5e_redirect_xsk_rqt(priv, ix, priv->drop_rq.rqn);
}
int mlx5e_xsk_redirect_rqts_to_channels(struct mlx5e_priv *priv, struct mlx5e_channels *chs)
{
int err, i;
if (!priv->xsk.refcnt)
return 0;
for (i = 0; i < chs->num; i++) {
struct mlx5e_channel *c = chs->c[i];
if (!test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
continue;
err = mlx5e_xsk_redirect_rqt_to_channel(priv, c);
if (unlikely(err))
goto err_stop;
}
return 0;
err_stop:
for (i--; i >= 0; i--) {
if (!test_bit(MLX5E_CHANNEL_STATE_XSK, chs->c[i]->state))
continue;
mlx5e_xsk_redirect_rqt_to_drop(priv, i);
}
return err;
}
void mlx5e_xsk_redirect_rqts_to_drop(struct mlx5e_priv *priv, struct mlx5e_channels *chs)
{
int i;
if (!priv->xsk.refcnt)
return;
for (i = 0; i < chs->num; i++) {
if (!test_bit(MLX5E_CHANNEL_STATE_XSK, chs->c[i]->state))
continue;
mlx5e_xsk_redirect_rqt_to_drop(priv, i);
}
}
/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
/* Copyright (c) 2019 Mellanox Technologies. */
#ifndef __MLX5_EN_XSK_SETUP_H__
#define __MLX5_EN_XSK_SETUP_H__
#include "en.h"
struct mlx5e_xsk_param;
bool mlx5e_validate_xsk_param(struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk,
struct mlx5_core_dev *mdev);
int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params,
struct mlx5e_xsk_param *xsk, struct xdp_umem *umem,
struct mlx5e_channel *c);
void mlx5e_close_xsk(struct mlx5e_channel *c);
void mlx5e_activate_xsk(struct mlx5e_channel *c);
void mlx5e_deactivate_xsk(struct mlx5e_channel *c);
int mlx5e_xsk_redirect_rqt_to_channel(struct mlx5e_priv *priv, struct mlx5e_channel *c);
int mlx5e_xsk_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix);
int mlx5e_xsk_redirect_rqts_to_channels(struct mlx5e_priv *priv, struct mlx5e_channels *chs);
void mlx5e_xsk_redirect_rqts_to_drop(struct mlx5e_priv *priv, struct mlx5e_channels *chs);
#endif /* __MLX5_EN_XSK_SETUP_H__ */
// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
/* Copyright (c) 2019 Mellanox Technologies. */
#include "tx.h"
#include "umem.h"
#include "en/xdp.h"
#include "en/params.h"
#include <net/xdp_sock.h>
int mlx5e_xsk_async_xmit(struct net_device *dev, u32 qid)
{
struct mlx5e_priv *priv = netdev_priv(dev);
struct mlx5e_params *params = &priv->channels.params;
struct mlx5e_channel *c;
u16 ix;
if (unlikely(!mlx5e_xdp_is_open(priv)))
return -ENETDOWN;
if (unlikely(!mlx5e_qid_get_ch_if_in_group(params, qid, MLX5E_RQ_GROUP_XSK, &ix)))
return -EINVAL;
c = priv->channels.c[ix];
if (unlikely(!test_bit(MLX5E_CHANNEL_STATE_XSK, c->state)))
return -ENXIO;
if (!napi_if_scheduled_mark_missed(&c->napi)) {
spin_lock(&c->xskicosq_lock);
mlx5e_trigger_irq(&c->xskicosq);
spin_unlock(&c->xskicosq_lock);
}
return 0;
}
/* When TX fails (because of the size of the packet), we need to get completions
* in order, so post a NOP to get a CQE. Since AF_XDP doesn't distinguish
* between successful TX and errors, handling in mlx5e_poll_xdpsq_cq is the
* same.
*/
static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq,
struct mlx5e_xdp_info *xdpi)
{
u16 pi = mlx5_wq_cyc_ctr2ix(&sq->wq, sq->pc);
struct mlx5e_xdp_wqe_info *wi = &sq->db.wqe_info[pi];
struct mlx5e_tx_wqe *nopwqe;
wi->num_wqebbs = 1;
wi->num_pkts = 1;
nopwqe = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc);
mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi);
sq->doorbell_cseg = &nopwqe->ctrl;
}
bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
{
struct xdp_umem *umem = sq->umem;
struct mlx5e_xdp_info xdpi;
struct mlx5e_xdp_xmit_data xdptxd;
bool work_done = true;
bool flush = false;
xdpi.mode = MLX5E_XDP_XMIT_MODE_XSK;
for (; budget; budget--) {
int check_result = sq->xmit_xdp_frame_check(sq);
struct xdp_desc desc;
if (unlikely(check_result < 0)) {
work_done = false;
break;
}
if (!xsk_umem_consume_tx(umem, &desc)) {
/* TX will get stuck until something wakes it up by
* triggering NAPI. Currently it's expected that the
* application calls sendto() if there are consumed, but
* not completed frames.
*/
break;
}
xdptxd.dma_addr = xdp_umem_get_dma(umem, desc.addr);
xdptxd.data = xdp_umem_get_data(umem, desc.addr);
xdptxd.len = desc.len;
dma_sync_single_for_device(sq->pdev, xdptxd.dma_addr,
xdptxd.len, DMA_BIDIRECTIONAL);
if (unlikely(!sq->xmit_xdp_frame(sq, &xdptxd, &xdpi, check_result))) {
if (sq->mpwqe.wqe)
mlx5e_xdp_mpwqe_complete(sq);
mlx5e_xsk_tx_post_err(sq, &xdpi);
}
flush = true;
}
if (flush) {
if (sq->mpwqe.wqe)
mlx5e_xdp_mpwqe_complete(sq);
mlx5e_xmit_xdp_doorbell(sq);
xsk_umem_consume_tx_done(umem);
}
return !(budget && work_done);
}
/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
/* Copyright (c) 2019 Mellanox Technologies. */
#ifndef __MLX5_EN_XSK_TX_H__
#define __MLX5_EN_XSK_TX_H__
#include "en.h"
/* TX data path */
int mlx5e_xsk_async_xmit(struct net_device *dev, u32 qid);
bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget);
#endif /* __MLX5_EN_XSK_TX_H__ */
// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
/* Copyright (c) 2019 Mellanox Technologies. */
#include <net/xdp_sock.h>
#include "umem.h"
#include "setup.h"
#include "en/params.h"
static int mlx5e_xsk_map_umem(struct mlx5e_priv *priv,
struct xdp_umem *umem)
{
struct device *dev = priv->mdev->device;
u32 i;
for (i = 0; i < umem->npgs; i++) {
dma_addr_t dma = dma_map_page(dev, umem->pgs[i], 0, PAGE_SIZE,
DMA_BIDIRECTIONAL);
if (unlikely(dma_mapping_error(dev, dma)))
goto err_unmap;
umem->pages[i].dma = dma;
}
return 0;
err_unmap:
while (i--) {
dma_unmap_page(dev, umem->pages[i].dma, PAGE_SIZE,
DMA_BIDIRECTIONAL);
umem->pages[i].dma = 0;
}
return -ENOMEM;
}
static void mlx5e_xsk_unmap_umem(struct mlx5e_priv *priv,
struct xdp_umem *umem)
{
struct device *dev = priv->mdev->device;
u32 i;
for (i = 0; i < umem->npgs; i++) {
dma_unmap_page(dev, umem->pages[i].dma, PAGE_SIZE,
DMA_BIDIRECTIONAL);
umem->pages[i].dma = 0;
}
}
static int mlx5e_xsk_get_umems(struct mlx5e_xsk *xsk)
{
if (!xsk->umems) {
xsk->umems = kcalloc(MLX5E_MAX_NUM_CHANNELS,
sizeof(*xsk->umems), GFP_KERNEL);
if (unlikely(!xsk->umems))
return -ENOMEM;
}
xsk->refcnt++;
xsk->ever_used = true;
return 0;
}
static void mlx5e_xsk_put_umems(struct mlx5e_xsk *xsk)
{
if (!--xsk->refcnt) {
kfree(xsk->umems);
xsk->umems = NULL;
}
}
static int mlx5e_xsk_add_umem(struct mlx5e_xsk *xsk, struct xdp_umem *umem, u16 ix)
{
int err;
err = mlx5e_xsk_get_umems(xsk);
if (unlikely(err))
return err;
xsk->umems[ix] = umem;
return 0;
}
static void mlx5e_xsk_remove_umem(struct mlx5e_xsk *xsk, u16 ix)
{
xsk->umems[ix] = NULL;
mlx5e_xsk_put_umems(xsk);
}
static bool mlx5e_xsk_is_umem_sane(struct xdp_umem *umem)
{
return umem->headroom <= 0xffff && umem->chunk_size_nohr <= 0xffff;
}
void mlx5e_build_xsk_param(struct xdp_umem *umem, struct mlx5e_xsk_param *xsk)
{
xsk->headroom = umem->headroom;
xsk->chunk_size = umem->chunk_size_nohr + umem->headroom;
}
static int mlx5e_xsk_enable_locked(struct mlx5e_priv *priv,
struct xdp_umem *umem, u16 ix)
{
struct mlx5e_params *params = &priv->channels.params;
struct mlx5e_xsk_param xsk;
struct mlx5e_channel *c;
int err;
if (unlikely(mlx5e_xsk_get_umem(&priv->channels.params, &priv->xsk, ix)))
return -EBUSY;
if (unlikely(!mlx5e_xsk_is_umem_sane(umem)))
return -EINVAL;
err = mlx5e_xsk_map_umem(priv, umem);
if (unlikely(err))
return err;
err = mlx5e_xsk_add_umem(&priv->xsk, umem, ix);
if (unlikely(err))
goto err_unmap_umem;
mlx5e_build_xsk_param(umem, &xsk);
if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
/* XSK objects will be created on open. */
goto validate_closed;
}
if (!params->xdp_prog) {
/* XSK objects will be created when an XDP program is set,
* and the channels are reopened.
*/
goto validate_closed;
}
c = priv->channels.c[ix];
err = mlx5e_open_xsk(priv, params, &xsk, umem, c);
if (unlikely(err))
goto err_remove_umem;
mlx5e_activate_xsk(c);
/* Don't wait for WQEs, because the newer xdpsock sample doesn't provide
* any Fill Ring entries at the setup stage.
*/
err = mlx5e_xsk_redirect_rqt_to_channel(priv, priv->channels.c[ix]);
if (unlikely(err))
goto err_deactivate;
return 0;
err_deactivate:
mlx5e_deactivate_xsk(c);
mlx5e_close_xsk(c);
err_remove_umem:
mlx5e_xsk_remove_umem(&priv->xsk, ix);
err_unmap_umem:
mlx5e_xsk_unmap_umem(priv, umem);
return err;
validate_closed:
/* Check the configuration in advance, rather than fail at a later stage
* (in mlx5e_xdp_set or on open) and end up with no channels.
*/
if (!mlx5e_validate_xsk_param(params, &xsk, priv->mdev)) {
err = -EINVAL;
goto err_remove_umem;
}
return 0;
}
static int mlx5e_xsk_disable_locked(struct mlx5e_priv *priv, u16 ix)
{
struct xdp_umem *umem = mlx5e_xsk_get_umem(&priv->channels.params,
&priv->xsk, ix);
struct mlx5e_channel *c;
if (unlikely(!umem))
return -EINVAL;
if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
goto remove_umem;
/* XSK RQ and SQ are only created if XDP program is set. */
if (!priv->channels.params.xdp_prog)
goto remove_umem;
c = priv->channels.c[ix];
mlx5e_xsk_redirect_rqt_to_drop(priv, ix);
mlx5e_deactivate_xsk(c);
mlx5e_close_xsk(c);
remove_umem:
mlx5e_xsk_remove_umem(&priv->xsk, ix);
mlx5e_xsk_unmap_umem(priv, umem);
return 0;
}
static int mlx5e_xsk_enable_umem(struct mlx5e_priv *priv, struct xdp_umem *umem,
u16 ix)
{
int err;
mutex_lock(&priv->state_lock);
err = mlx5e_xsk_enable_locked(priv, umem, ix);
mutex_unlock(&priv->state_lock);
return err;
}
static int mlx5e_xsk_disable_umem(struct mlx5e_priv *priv, u16 ix)
{
int err;
mutex_lock(&priv->state_lock);
err = mlx5e_xsk_disable_locked(priv, ix);
mutex_unlock(&priv->state_lock);
return err;
}
int mlx5e_xsk_setup_umem(struct net_device *dev, struct xdp_umem *umem, u16 qid)
{
struct mlx5e_priv *priv = netdev_priv(dev);
struct mlx5e_params *params = &priv->channels.params;
u16 ix;
if (unlikely(!mlx5e_qid_get_ch_if_in_group(params, qid, MLX5E_RQ_GROUP_XSK, &ix)))
return -EINVAL;
return umem ? mlx5e_xsk_enable_umem(priv, umem, ix) :
mlx5e_xsk_disable_umem(priv, ix);
}
int mlx5e_xsk_resize_reuseq(struct xdp_umem *umem, u32 nentries)
{
struct xdp_umem_fq_reuse *reuseq;
reuseq = xsk_reuseq_prepare(nentries);
if (unlikely(!reuseq))
return -ENOMEM;
xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
return 0;
}
u16 mlx5e_xsk_first_unused_channel(struct mlx5e_params *params, struct mlx5e_xsk *xsk)
{
u16 res = xsk->refcnt ? params->num_channels : 0;
while (res) {
if (mlx5e_xsk_get_umem(params, xsk, res - 1))
break;
--res;
}
return res;
}
/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
/* Copyright (c) 2019 Mellanox Technologies. */
#ifndef __MLX5_EN_XSK_UMEM_H__
#define __MLX5_EN_XSK_UMEM_H__
#include "en.h"
static inline struct xdp_umem *mlx5e_xsk_get_umem(struct mlx5e_params *params,
struct mlx5e_xsk *xsk, u16 ix)
{
if (!xsk || !xsk->umems)
return NULL;
if (unlikely(ix >= params->num_channels))
return NULL;
return xsk->umems[ix];
}
struct mlx5e_xsk_param;
void mlx5e_build_xsk_param(struct xdp_umem *umem, struct mlx5e_xsk_param *xsk);
/* .ndo_bpf callback. */
int mlx5e_xsk_setup_umem(struct net_device *dev, struct xdp_umem *umem, u16 qid);
int mlx5e_xsk_resize_reuseq(struct xdp_umem *umem, u32 nentries);
u16 mlx5e_xsk_first_unused_channel(struct mlx5e_params *params, struct mlx5e_xsk *xsk);
#endif /* __MLX5_EN_XSK_UMEM_H__ */
...@@ -32,6 +32,7 @@ ...@@ -32,6 +32,7 @@
#include "en.h" #include "en.h"
#include "en/port.h" #include "en/port.h"
#include "en/xsk/umem.h"
#include "lib/clock.h" #include "lib/clock.h"
void mlx5e_ethtool_get_drvinfo(struct mlx5e_priv *priv, void mlx5e_ethtool_get_drvinfo(struct mlx5e_priv *priv,
...@@ -388,8 +389,17 @@ static int mlx5e_set_ringparam(struct net_device *dev, ...@@ -388,8 +389,17 @@ static int mlx5e_set_ringparam(struct net_device *dev,
void mlx5e_ethtool_get_channels(struct mlx5e_priv *priv, void mlx5e_ethtool_get_channels(struct mlx5e_priv *priv,
struct ethtool_channels *ch) struct ethtool_channels *ch)
{ {
mutex_lock(&priv->state_lock);
ch->max_combined = mlx5e_get_netdev_max_channels(priv->netdev); ch->max_combined = mlx5e_get_netdev_max_channels(priv->netdev);
ch->combined_count = priv->channels.params.num_channels; ch->combined_count = priv->channels.params.num_channels;
if (priv->xsk.refcnt) {
/* The upper half are XSK queues. */
ch->max_combined *= 2;
ch->combined_count *= 2;
}
mutex_unlock(&priv->state_lock);
} }
static void mlx5e_get_channels(struct net_device *dev, static void mlx5e_get_channels(struct net_device *dev,
...@@ -403,6 +413,7 @@ static void mlx5e_get_channels(struct net_device *dev, ...@@ -403,6 +413,7 @@ static void mlx5e_get_channels(struct net_device *dev,
int mlx5e_ethtool_set_channels(struct mlx5e_priv *priv, int mlx5e_ethtool_set_channels(struct mlx5e_priv *priv,
struct ethtool_channels *ch) struct ethtool_channels *ch)
{ {
struct mlx5e_params *cur_params = &priv->channels.params;
unsigned int count = ch->combined_count; unsigned int count = ch->combined_count;
struct mlx5e_channels new_channels = {}; struct mlx5e_channels new_channels = {};
bool arfs_enabled; bool arfs_enabled;
...@@ -414,16 +425,26 @@ int mlx5e_ethtool_set_channels(struct mlx5e_priv *priv, ...@@ -414,16 +425,26 @@ int mlx5e_ethtool_set_channels(struct mlx5e_priv *priv,
return -EINVAL; return -EINVAL;
} }
if (priv->channels.params.num_channels == count) if (cur_params->num_channels == count)
return 0; return 0;
mutex_lock(&priv->state_lock); mutex_lock(&priv->state_lock);
/* Don't allow changing the number of channels if there is an active
* XSK, because the numeration of the XSK and regular RQs will change.
*/
if (priv->xsk.refcnt) {
err = -EINVAL;
netdev_err(priv->netdev, "%s: AF_XDP is active, cannot change the number of channels\n",
__func__);
goto out;
}
new_channels.params = priv->channels.params; new_channels.params = priv->channels.params;
new_channels.params.num_channels = count; new_channels.params.num_channels = count;
if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) { if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
priv->channels.params = new_channels.params; *cur_params = new_channels.params;
if (!netif_is_rxfh_configured(priv->netdev)) if (!netif_is_rxfh_configured(priv->netdev))
mlx5e_build_default_indir_rqt(priv->rss_params.indirection_rqt, mlx5e_build_default_indir_rqt(priv->rss_params.indirection_rqt,
MLX5E_INDIR_RQT_SIZE, count); MLX5E_INDIR_RQT_SIZE, count);
......
...@@ -32,6 +32,8 @@ ...@@ -32,6 +32,8 @@
#include <linux/mlx5/fs.h> #include <linux/mlx5/fs.h>
#include "en.h" #include "en.h"
#include "en/params.h"
#include "en/xsk/umem.h"
struct mlx5e_ethtool_rule { struct mlx5e_ethtool_rule {
struct list_head list; struct list_head list;
...@@ -414,6 +416,14 @@ add_ethtool_flow_rule(struct mlx5e_priv *priv, ...@@ -414,6 +416,14 @@ add_ethtool_flow_rule(struct mlx5e_priv *priv,
if (fs->ring_cookie == RX_CLS_FLOW_DISC) { if (fs->ring_cookie == RX_CLS_FLOW_DISC) {
flow_act.action = MLX5_FLOW_CONTEXT_ACTION_DROP; flow_act.action = MLX5_FLOW_CONTEXT_ACTION_DROP;
} else { } else {
struct mlx5e_params *params = &priv->channels.params;
enum mlx5e_rq_group group;
struct mlx5e_tir *tir;
u16 ix;
mlx5e_qid_get_ch_and_group(params, fs->ring_cookie, &ix, &group);
tir = group == MLX5E_RQ_GROUP_XSK ? priv->xsk_tir : priv->direct_tir;
dst = kzalloc(sizeof(*dst), GFP_KERNEL); dst = kzalloc(sizeof(*dst), GFP_KERNEL);
if (!dst) { if (!dst) {
err = -ENOMEM; err = -ENOMEM;
...@@ -421,7 +431,7 @@ add_ethtool_flow_rule(struct mlx5e_priv *priv, ...@@ -421,7 +431,7 @@ add_ethtool_flow_rule(struct mlx5e_priv *priv,
} }
dst->type = MLX5_FLOW_DESTINATION_TYPE_TIR; dst->type = MLX5_FLOW_DESTINATION_TYPE_TIR;
dst->tir_num = priv->direct_tir[fs->ring_cookie].tirn; dst->tir_num = tir[ix].tirn;
flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST; flow_act.action = MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
} }
...@@ -600,8 +610,8 @@ static int validate_flow(struct mlx5e_priv *priv, ...@@ -600,8 +610,8 @@ static int validate_flow(struct mlx5e_priv *priv,
if (fs->location >= MAX_NUM_OF_ETHTOOL_RULES) if (fs->location >= MAX_NUM_OF_ETHTOOL_RULES)
return -ENOSPC; return -ENOSPC;
if (fs->ring_cookie >= priv->channels.params.num_channels && if (fs->ring_cookie != RX_CLS_FLOW_DISC)
fs->ring_cookie != RX_CLS_FLOW_DISC) if (!mlx5e_qid_validate(&priv->channels.params, fs->ring_cookie))
return -EINVAL; return -EINVAL;
switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) { switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
......
...@@ -1510,7 +1510,7 @@ static int mlx5e_init_rep_rx(struct mlx5e_priv *priv) ...@@ -1510,7 +1510,7 @@ static int mlx5e_init_rep_rx(struct mlx5e_priv *priv)
if (err) if (err)
goto err_close_drop_rq; goto err_close_drop_rq;
err = mlx5e_create_direct_rqts(priv); err = mlx5e_create_direct_rqts(priv, priv->direct_tir);
if (err) if (err)
goto err_destroy_indirect_rqts; goto err_destroy_indirect_rqts;
...@@ -1518,7 +1518,7 @@ static int mlx5e_init_rep_rx(struct mlx5e_priv *priv) ...@@ -1518,7 +1518,7 @@ static int mlx5e_init_rep_rx(struct mlx5e_priv *priv)
if (err) if (err)
goto err_destroy_direct_rqts; goto err_destroy_direct_rqts;
err = mlx5e_create_direct_tirs(priv); err = mlx5e_create_direct_tirs(priv, priv->direct_tir);
if (err) if (err)
goto err_destroy_indirect_tirs; goto err_destroy_indirect_tirs;
...@@ -1535,11 +1535,11 @@ static int mlx5e_init_rep_rx(struct mlx5e_priv *priv) ...@@ -1535,11 +1535,11 @@ static int mlx5e_init_rep_rx(struct mlx5e_priv *priv)
err_destroy_ttc_table: err_destroy_ttc_table:
mlx5e_destroy_ttc_table(priv, &priv->fs.ttc); mlx5e_destroy_ttc_table(priv, &priv->fs.ttc);
err_destroy_direct_tirs: err_destroy_direct_tirs:
mlx5e_destroy_direct_tirs(priv); mlx5e_destroy_direct_tirs(priv, priv->direct_tir);
err_destroy_indirect_tirs: err_destroy_indirect_tirs:
mlx5e_destroy_indirect_tirs(priv, false); mlx5e_destroy_indirect_tirs(priv, false);
err_destroy_direct_rqts: err_destroy_direct_rqts:
mlx5e_destroy_direct_rqts(priv); mlx5e_destroy_direct_rqts(priv, priv->direct_tir);
err_destroy_indirect_rqts: err_destroy_indirect_rqts:
mlx5e_destroy_rqt(priv, &priv->indir_rqt); mlx5e_destroy_rqt(priv, &priv->indir_rqt);
err_close_drop_rq: err_close_drop_rq:
...@@ -1553,9 +1553,9 @@ static void mlx5e_cleanup_rep_rx(struct mlx5e_priv *priv) ...@@ -1553,9 +1553,9 @@ static void mlx5e_cleanup_rep_rx(struct mlx5e_priv *priv)
mlx5_del_flow_rules(rpriv->vport_rx_rule); mlx5_del_flow_rules(rpriv->vport_rx_rule);
mlx5e_destroy_ttc_table(priv, &priv->fs.ttc); mlx5e_destroy_ttc_table(priv, &priv->fs.ttc);
mlx5e_destroy_direct_tirs(priv); mlx5e_destroy_direct_tirs(priv, priv->direct_tir);
mlx5e_destroy_indirect_tirs(priv, false); mlx5e_destroy_indirect_tirs(priv, false);
mlx5e_destroy_direct_rqts(priv); mlx5e_destroy_direct_rqts(priv, priv->direct_tir);
mlx5e_destroy_rqt(priv, &priv->indir_rqt); mlx5e_destroy_rqt(priv, &priv->indir_rqt);
mlx5e_close_drop_rq(&priv->drop_rq); mlx5e_close_drop_rq(&priv->drop_rq);
} }
......
...@@ -47,6 +47,7 @@ ...@@ -47,6 +47,7 @@
#include "en_accel/tls_rxtx.h" #include "en_accel/tls_rxtx.h"
#include "lib/clock.h" #include "lib/clock.h"
#include "en/xdp.h" #include "en/xdp.h"
#include "en/xsk/rx.h"
static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config) static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
{ {
...@@ -235,7 +236,7 @@ static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq, ...@@ -235,7 +236,7 @@ static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
return true; return true;
} }
static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq, static inline int mlx5e_page_alloc_pool(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info) struct mlx5e_dma_info *dma_info)
{ {
if (mlx5e_rx_cache_get(rq, dma_info)) if (mlx5e_rx_cache_get(rq, dma_info))
...@@ -256,12 +257,22 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq, ...@@ -256,12 +257,22 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
return 0; return 0;
} }
static inline int mlx5e_page_alloc(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info)
{
if (rq->umem)
return mlx5e_xsk_page_alloc_umem(rq, dma_info);
else
return mlx5e_page_alloc_pool(rq, dma_info);
}
void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info) void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info)
{ {
dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, rq->buff.map_dir); dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, rq->buff.map_dir);
} }
void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info, void mlx5e_page_release_dynamic(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info,
bool recycle) bool recycle)
{ {
if (likely(recycle)) { if (likely(recycle)) {
...@@ -277,6 +288,20 @@ void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info, ...@@ -277,6 +288,20 @@ void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
} }
} }
static inline void mlx5e_page_release(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info,
bool recycle)
{
if (rq->umem)
/* The `recycle` parameter is ignored, and the page is always
* put into the Reuse Ring, because there is no way to return
* the page to the userspace when the interface goes down.
*/
mlx5e_xsk_page_release(rq, dma_info);
else
mlx5e_page_release_dynamic(rq, dma_info, recycle);
}
static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq, static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq,
struct mlx5e_wqe_frag_info *frag) struct mlx5e_wqe_frag_info *frag)
{ {
...@@ -288,7 +313,7 @@ static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq, ...@@ -288,7 +313,7 @@ static inline int mlx5e_get_rx_frag(struct mlx5e_rq *rq,
* offset) should just use the new one without replenishing again * offset) should just use the new one without replenishing again
* by themselves. * by themselves.
*/ */
err = mlx5e_page_alloc_mapped(rq, frag->di); err = mlx5e_page_alloc(rq, frag->di);
return err; return err;
} }
...@@ -354,6 +379,13 @@ static int mlx5e_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, u8 wqe_bulk) ...@@ -354,6 +379,13 @@ static int mlx5e_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, u8 wqe_bulk)
int err; int err;
int i; int i;
if (rq->umem) {
int pages_desired = wqe_bulk << rq->wqe.info.log_num_frags;
if (unlikely(!mlx5e_xsk_pages_enough_umem(rq, pages_desired)))
return -ENOMEM;
}
for (i = 0; i < wqe_bulk; i++) { for (i = 0; i < wqe_bulk; i++) {
struct mlx5e_rx_wqe_cyc *wqe = mlx5_wq_cyc_get_wqe(wq, ix + i); struct mlx5e_rx_wqe_cyc *wqe = mlx5_wq_cyc_get_wqe(wq, ix + i);
...@@ -401,11 +433,17 @@ mlx5e_copy_skb_header(struct device *pdev, struct sk_buff *skb, ...@@ -401,11 +433,17 @@ mlx5e_copy_skb_header(struct device *pdev, struct sk_buff *skb,
static void static void
mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi, bool recycle) mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi, bool recycle)
{ {
const bool no_xdp_xmit = bool no_xdp_xmit;
bitmap_empty(wi->xdp_xmit_bitmap, MLX5_MPWRQ_PAGES_PER_WQE);
struct mlx5e_dma_info *dma_info = wi->umr.dma_info; struct mlx5e_dma_info *dma_info = wi->umr.dma_info;
int i; int i;
/* A common case for AF_XDP. */
if (bitmap_full(wi->xdp_xmit_bitmap, MLX5_MPWRQ_PAGES_PER_WQE))
return;
no_xdp_xmit = bitmap_empty(wi->xdp_xmit_bitmap,
MLX5_MPWRQ_PAGES_PER_WQE);
for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++)
if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap)) if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap))
mlx5e_page_release(rq, &dma_info[i], recycle); mlx5e_page_release(rq, &dma_info[i], recycle);
...@@ -427,11 +465,6 @@ static void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq, u8 n) ...@@ -427,11 +465,6 @@ static void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq, u8 n)
mlx5_wq_ll_update_db_record(wq); mlx5_wq_ll_update_db_record(wq);
} }
static inline u16 mlx5e_icosq_wrap_cnt(struct mlx5e_icosq *sq)
{
return mlx5_wq_cyc_get_ctr_wrap_cnt(&sq->wq, sq->pc);
}
static inline void mlx5e_fill_icosq_frag_edge(struct mlx5e_icosq *sq, static inline void mlx5e_fill_icosq_frag_edge(struct mlx5e_icosq *sq,
struct mlx5_wq_cyc *wq, struct mlx5_wq_cyc *wq,
u16 pi, u16 nnops) u16 pi, u16 nnops)
...@@ -459,6 +492,12 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix) ...@@ -459,6 +492,12 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
int err; int err;
int i; int i;
if (rq->umem &&
unlikely(!mlx5e_xsk_pages_enough_umem(rq, MLX5_MPWRQ_PAGES_PER_WQE))) {
err = -ENOMEM;
goto err;
}
pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc); pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
contig_wqebbs_room = mlx5_wq_cyc_get_contig_wqebbs(wq, pi); contig_wqebbs_room = mlx5_wq_cyc_get_contig_wqebbs(wq, pi);
if (unlikely(contig_wqebbs_room < MLX5E_UMR_WQEBBS)) { if (unlikely(contig_wqebbs_room < MLX5E_UMR_WQEBBS)) {
...@@ -467,12 +506,10 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix) ...@@ -467,12 +506,10 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
} }
umr_wqe = mlx5_wq_cyc_get_wqe(wq, pi); umr_wqe = mlx5_wq_cyc_get_wqe(wq, pi);
if (unlikely(mlx5e_icosq_wrap_cnt(sq) < 2)) memcpy(umr_wqe, &rq->mpwqe.umr_wqe, offsetof(struct mlx5e_umr_wqe, inline_mtts));
memcpy(umr_wqe, &rq->mpwqe.umr_wqe,
offsetof(struct mlx5e_umr_wqe, inline_mtts));
for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++, dma_info++) { for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++, dma_info++) {
err = mlx5e_page_alloc_mapped(rq, dma_info); err = mlx5e_page_alloc(rq, dma_info);
if (unlikely(err)) if (unlikely(err))
goto err_unmap; goto err_unmap;
umr_wqe->inline_mtts[i].ptag = cpu_to_be64(dma_info->addr | MLX5_EN_WR); umr_wqe->inline_mtts[i].ptag = cpu_to_be64(dma_info->addr | MLX5_EN_WR);
...@@ -487,6 +524,7 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix) ...@@ -487,6 +524,7 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
umr_wqe->uctrl.xlt_offset = cpu_to_be16(xlt_offset); umr_wqe->uctrl.xlt_offset = cpu_to_be16(xlt_offset);
sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_UMR; sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_UMR;
sq->db.ico_wqe[pi].umr.rq = rq;
sq->pc += MLX5E_UMR_WQEBBS; sq->pc += MLX5E_UMR_WQEBBS;
sq->doorbell_cseg = &umr_wqe->ctrl; sq->doorbell_cseg = &umr_wqe->ctrl;
...@@ -498,6 +536,8 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix) ...@@ -498,6 +536,8 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
dma_info--; dma_info--;
mlx5e_page_release(rq, dma_info, true); mlx5e_page_release(rq, dma_info, true);
} }
err:
rq->stats->buff_alloc_err++; rq->stats->buff_alloc_err++;
return err; return err;
...@@ -544,11 +584,10 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq) ...@@ -544,11 +584,10 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
return !!err; return !!err;
} }
static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq) void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
{ {
struct mlx5e_icosq *sq = container_of(cq, struct mlx5e_icosq, cq); struct mlx5e_icosq *sq = container_of(cq, struct mlx5e_icosq, cq);
struct mlx5_cqe64 *cqe; struct mlx5_cqe64 *cqe;
u8 completed_umr = 0;
u16 sqcc; u16 sqcc;
int i; int i;
...@@ -589,7 +628,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq) ...@@ -589,7 +628,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq)
if (likely(wi->opcode == MLX5_OPCODE_UMR)) { if (likely(wi->opcode == MLX5_OPCODE_UMR)) {
sqcc += MLX5E_UMR_WQEBBS; sqcc += MLX5E_UMR_WQEBBS;
completed_umr++; wi->umr.rq->mpwqe.umr_completed++;
} else if (likely(wi->opcode == MLX5_OPCODE_NOP)) { } else if (likely(wi->opcode == MLX5_OPCODE_NOP)) {
sqcc++; sqcc++;
} else { } else {
...@@ -605,24 +644,25 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq) ...@@ -605,24 +644,25 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq)
sq->cc = sqcc; sq->cc = sqcc;
mlx5_cqwq_update_db_record(&cq->wq); mlx5_cqwq_update_db_record(&cq->wq);
if (likely(completed_umr)) {
mlx5e_post_rx_mpwqe(rq, completed_umr);
rq->mpwqe.umr_in_progress -= completed_umr;
}
} }
bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq) bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq)
{ {
struct mlx5e_icosq *sq = &rq->channel->icosq; struct mlx5e_icosq *sq = &rq->channel->icosq;
struct mlx5_wq_ll *wq = &rq->mpwqe.wq; struct mlx5_wq_ll *wq = &rq->mpwqe.wq;
u8 umr_completed = rq->mpwqe.umr_completed;
int alloc_err = 0;
u8 missing, i; u8 missing, i;
u16 head; u16 head;
if (unlikely(!test_bit(MLX5E_RQ_STATE_ENABLED, &rq->state))) if (unlikely(!test_bit(MLX5E_RQ_STATE_ENABLED, &rq->state)))
return false; return false;
mlx5e_poll_ico_cq(&sq->cq, rq); if (umr_completed) {
mlx5e_post_rx_mpwqe(rq, umr_completed);
rq->mpwqe.umr_in_progress -= umr_completed;
rq->mpwqe.umr_completed = 0;
}
missing = mlx5_wq_ll_missing(wq) - rq->mpwqe.umr_in_progress; missing = mlx5_wq_ll_missing(wq) - rq->mpwqe.umr_in_progress;
...@@ -636,7 +676,9 @@ bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq) ...@@ -636,7 +676,9 @@ bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq)
head = rq->mpwqe.actual_wq_head; head = rq->mpwqe.actual_wq_head;
i = missing; i = missing;
do { do {
if (unlikely(mlx5e_alloc_rx_mpwqe(rq, head))) alloc_err = mlx5e_alloc_rx_mpwqe(rq, head);
if (unlikely(alloc_err))
break; break;
head = mlx5_wq_ll_get_wqe_next_ix(wq, head); head = mlx5_wq_ll_get_wqe_next_ix(wq, head);
} while (--i); } while (--i);
...@@ -650,6 +692,12 @@ bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq) ...@@ -650,6 +692,12 @@ bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq)
rq->mpwqe.umr_in_progress += rq->mpwqe.umr_last_bulk; rq->mpwqe.umr_in_progress += rq->mpwqe.umr_last_bulk;
rq->mpwqe.actual_wq_head = head; rq->mpwqe.actual_wq_head = head;
/* If XSK Fill Ring doesn't have enough frames, busy poll by
* rescheduling the NAPI poll.
*/
if (unlikely(alloc_err == -ENOMEM && rq->umem))
return true;
return false; return false;
} }
...@@ -1018,7 +1066,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe, ...@@ -1018,7 +1066,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
} }
rcu_read_lock(); rcu_read_lock();
consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt); consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, false);
rcu_read_unlock(); rcu_read_unlock();
if (consumed) if (consumed)
return NULL; /* page/packet was consumed by XDP */ return NULL; /* page/packet was consumed by XDP */
...@@ -1235,7 +1283,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi, ...@@ -1235,7 +1283,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
prefetch(data); prefetch(data);
rcu_read_lock(); rcu_read_lock();
consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32); consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, false);
rcu_read_unlock(); rcu_read_unlock();
if (consumed) { if (consumed) {
if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
......
...@@ -46,6 +46,8 @@ ...@@ -46,6 +46,8 @@
#define MLX5E_DECLARE_TX_STAT(type, fld) "tx%d_"#fld, offsetof(type, fld) #define MLX5E_DECLARE_TX_STAT(type, fld) "tx%d_"#fld, offsetof(type, fld)
#define MLX5E_DECLARE_XDPSQ_STAT(type, fld) "tx%d_xdp_"#fld, offsetof(type, fld) #define MLX5E_DECLARE_XDPSQ_STAT(type, fld) "tx%d_xdp_"#fld, offsetof(type, fld)
#define MLX5E_DECLARE_RQ_XDPSQ_STAT(type, fld) "rx%d_xdp_tx_"#fld, offsetof(type, fld) #define MLX5E_DECLARE_RQ_XDPSQ_STAT(type, fld) "rx%d_xdp_tx_"#fld, offsetof(type, fld)
#define MLX5E_DECLARE_XSKRQ_STAT(type, fld) "rx%d_xsk_"#fld, offsetof(type, fld)
#define MLX5E_DECLARE_XSKSQ_STAT(type, fld) "tx%d_xsk_"#fld, offsetof(type, fld)
#define MLX5E_DECLARE_CH_STAT(type, fld) "ch%d_"#fld, offsetof(type, fld) #define MLX5E_DECLARE_CH_STAT(type, fld) "ch%d_"#fld, offsetof(type, fld)
struct counter_desc { struct counter_desc {
...@@ -116,12 +118,39 @@ struct mlx5e_sw_stats { ...@@ -116,12 +118,39 @@ struct mlx5e_sw_stats {
u64 ch_poll; u64 ch_poll;
u64 ch_arm; u64 ch_arm;
u64 ch_aff_change; u64 ch_aff_change;
u64 ch_force_irq;
u64 ch_eq_rearm; u64 ch_eq_rearm;
#ifdef CONFIG_MLX5_EN_TLS #ifdef CONFIG_MLX5_EN_TLS
u64 tx_tls_ooo; u64 tx_tls_ooo;
u64 tx_tls_resync_bytes; u64 tx_tls_resync_bytes;
#endif #endif
u64 rx_xsk_packets;
u64 rx_xsk_bytes;
u64 rx_xsk_csum_complete;
u64 rx_xsk_csum_unnecessary;
u64 rx_xsk_csum_unnecessary_inner;
u64 rx_xsk_csum_none;
u64 rx_xsk_ecn_mark;
u64 rx_xsk_removed_vlan_packets;
u64 rx_xsk_xdp_drop;
u64 rx_xsk_xdp_redirect;
u64 rx_xsk_wqe_err;
u64 rx_xsk_mpwqe_filler_cqes;
u64 rx_xsk_mpwqe_filler_strides;
u64 rx_xsk_oversize_pkts_sw_drop;
u64 rx_xsk_buff_alloc_err;
u64 rx_xsk_cqe_compress_blks;
u64 rx_xsk_cqe_compress_pkts;
u64 rx_xsk_congst_umr;
u64 rx_xsk_arfs_err;
u64 tx_xsk_xmit;
u64 tx_xsk_mpwqe;
u64 tx_xsk_inlnw;
u64 tx_xsk_full;
u64 tx_xsk_err;
u64 tx_xsk_cqes;
}; };
struct mlx5e_qcounter_stats { struct mlx5e_qcounter_stats {
...@@ -256,6 +285,7 @@ struct mlx5e_ch_stats { ...@@ -256,6 +285,7 @@ struct mlx5e_ch_stats {
u64 poll; u64 poll;
u64 arm; u64 arm;
u64 aff_change; u64 aff_change;
u64 force_irq;
u64 eq_rearm; u64 eq_rearm;
}; };
......
...@@ -33,6 +33,7 @@ ...@@ -33,6 +33,7 @@
#include <linux/irq.h> #include <linux/irq.h>
#include "en.h" #include "en.h"
#include "en/xdp.h" #include "en/xdp.h"
#include "en/xsk/tx.h"
static inline bool mlx5e_channel_no_affinity_change(struct mlx5e_channel *c) static inline bool mlx5e_channel_no_affinity_change(struct mlx5e_channel *c)
{ {
...@@ -85,7 +86,12 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget) ...@@ -85,7 +86,12 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
struct mlx5e_channel *c = container_of(napi, struct mlx5e_channel, struct mlx5e_channel *c = container_of(napi, struct mlx5e_channel,
napi); napi);
struct mlx5e_ch_stats *ch_stats = c->stats; struct mlx5e_ch_stats *ch_stats = c->stats;
struct mlx5e_xdpsq *xsksq = &c->xsksq;
struct mlx5e_rq *xskrq = &c->xskrq;
struct mlx5e_rq *rq = &c->rq; struct mlx5e_rq *rq = &c->rq;
bool xsk_open = test_bit(MLX5E_CHANNEL_STATE_XSK, c->state);
bool aff_change = false;
bool busy_xsk = false;
bool busy = false; bool busy = false;
int work_done = 0; int work_done = 0;
int i; int i;
...@@ -95,22 +101,38 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget) ...@@ -95,22 +101,38 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
for (i = 0; i < c->num_tc; i++) for (i = 0; i < c->num_tc; i++)
busy |= mlx5e_poll_tx_cq(&c->sq[i].cq, budget); busy |= mlx5e_poll_tx_cq(&c->sq[i].cq, budget);
busy |= mlx5e_poll_xdpsq_cq(&c->xdpsq.cq, NULL); busy |= mlx5e_poll_xdpsq_cq(&c->xdpsq.cq);
if (c->xdp) if (c->xdp)
busy |= mlx5e_poll_xdpsq_cq(&rq->xdpsq.cq, rq); busy |= mlx5e_poll_xdpsq_cq(&c->rq_xdpsq.cq);
if (likely(budget)) { /* budget=0 means: don't poll rx rings */ if (likely(budget)) { /* budget=0 means: don't poll rx rings */
work_done = mlx5e_poll_rx_cq(&rq->cq, budget); if (xsk_open)
work_done = mlx5e_poll_rx_cq(&xskrq->cq, budget);
if (likely(budget - work_done))
work_done += mlx5e_poll_rx_cq(&rq->cq, budget - work_done);
busy |= work_done == budget; busy |= work_done == budget;
} }
busy |= c->rq.post_wqes(rq); mlx5e_poll_ico_cq(&c->icosq.cq);
busy |= rq->post_wqes(rq);
if (xsk_open) {
mlx5e_poll_ico_cq(&c->xskicosq.cq);
busy |= mlx5e_poll_xdpsq_cq(&xsksq->cq);
busy_xsk |= mlx5e_xsk_tx(xsksq, MLX5E_TX_XSK_POLL_BUDGET);
busy_xsk |= xskrq->post_wqes(xskrq);
}
busy |= busy_xsk;
if (busy) { if (busy) {
if (likely(mlx5e_channel_no_affinity_change(c))) if (likely(mlx5e_channel_no_affinity_change(c)))
return budget; return budget;
ch_stats->aff_change++; ch_stats->aff_change++;
aff_change = true;
if (budget && work_done == budget) if (budget && work_done == budget)
work_done--; work_done--;
} }
...@@ -131,6 +153,18 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget) ...@@ -131,6 +153,18 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
mlx5e_cq_arm(&c->icosq.cq); mlx5e_cq_arm(&c->icosq.cq);
mlx5e_cq_arm(&c->xdpsq.cq); mlx5e_cq_arm(&c->xdpsq.cq);
if (xsk_open) {
mlx5e_handle_rx_dim(xskrq);
mlx5e_cq_arm(&c->xskicosq.cq);
mlx5e_cq_arm(&xsksq->cq);
mlx5e_cq_arm(&xskrq->cq);
}
if (unlikely(aff_change && busy_xsk)) {
mlx5e_trigger_irq(&c->icosq);
ch_stats->force_irq++;
}
return work_done; return work_done;
} }
......
...@@ -87,7 +87,7 @@ int mlx5i_init(struct mlx5_core_dev *mdev, ...@@ -87,7 +87,7 @@ int mlx5i_init(struct mlx5_core_dev *mdev,
mlx5e_set_netdev_mtu_boundaries(priv); mlx5e_set_netdev_mtu_boundaries(priv);
netdev->mtu = netdev->max_mtu; netdev->mtu = netdev->max_mtu;
mlx5e_build_nic_params(mdev, &priv->rss_params, &priv->channels.params, mlx5e_build_nic_params(mdev, NULL, &priv->rss_params, &priv->channels.params,
mlx5e_get_netdev_max_channels(netdev), mlx5e_get_netdev_max_channels(netdev),
netdev->mtu); netdev->mtu);
mlx5i_build_nic_params(mdev, &priv->channels.params); mlx5i_build_nic_params(mdev, &priv->channels.params);
...@@ -365,7 +365,7 @@ static int mlx5i_init_rx(struct mlx5e_priv *priv) ...@@ -365,7 +365,7 @@ static int mlx5i_init_rx(struct mlx5e_priv *priv)
if (err) if (err)
goto err_close_drop_rq; goto err_close_drop_rq;
err = mlx5e_create_direct_rqts(priv); err = mlx5e_create_direct_rqts(priv, priv->direct_tir);
if (err) if (err)
goto err_destroy_indirect_rqts; goto err_destroy_indirect_rqts;
...@@ -373,7 +373,7 @@ static int mlx5i_init_rx(struct mlx5e_priv *priv) ...@@ -373,7 +373,7 @@ static int mlx5i_init_rx(struct mlx5e_priv *priv)
if (err) if (err)
goto err_destroy_direct_rqts; goto err_destroy_direct_rqts;
err = mlx5e_create_direct_tirs(priv); err = mlx5e_create_direct_tirs(priv, priv->direct_tir);
if (err) if (err)
goto err_destroy_indirect_tirs; goto err_destroy_indirect_tirs;
...@@ -384,11 +384,11 @@ static int mlx5i_init_rx(struct mlx5e_priv *priv) ...@@ -384,11 +384,11 @@ static int mlx5i_init_rx(struct mlx5e_priv *priv)
return 0; return 0;
err_destroy_direct_tirs: err_destroy_direct_tirs:
mlx5e_destroy_direct_tirs(priv); mlx5e_destroy_direct_tirs(priv, priv->direct_tir);
err_destroy_indirect_tirs: err_destroy_indirect_tirs:
mlx5e_destroy_indirect_tirs(priv, true); mlx5e_destroy_indirect_tirs(priv, true);
err_destroy_direct_rqts: err_destroy_direct_rqts:
mlx5e_destroy_direct_rqts(priv); mlx5e_destroy_direct_rqts(priv, priv->direct_tir);
err_destroy_indirect_rqts: err_destroy_indirect_rqts:
mlx5e_destroy_rqt(priv, &priv->indir_rqt); mlx5e_destroy_rqt(priv, &priv->indir_rqt);
err_close_drop_rq: err_close_drop_rq:
...@@ -401,9 +401,9 @@ static int mlx5i_init_rx(struct mlx5e_priv *priv) ...@@ -401,9 +401,9 @@ static int mlx5i_init_rx(struct mlx5e_priv *priv)
static void mlx5i_cleanup_rx(struct mlx5e_priv *priv) static void mlx5i_cleanup_rx(struct mlx5e_priv *priv)
{ {
mlx5i_destroy_flow_steering(priv); mlx5i_destroy_flow_steering(priv);
mlx5e_destroy_direct_tirs(priv); mlx5e_destroy_direct_tirs(priv, priv->direct_tir);
mlx5e_destroy_indirect_tirs(priv, true); mlx5e_destroy_indirect_tirs(priv, true);
mlx5e_destroy_direct_rqts(priv); mlx5e_destroy_direct_rqts(priv, priv->direct_tir);
mlx5e_destroy_rqt(priv, &priv->indir_rqt); mlx5e_destroy_rqt(priv, &priv->indir_rqt);
mlx5e_close_drop_rq(&priv->drop_rq); mlx5e_close_drop_rq(&priv->drop_rq);
mlx5e_destroy_q_counters(priv); mlx5e_destroy_q_counters(priv);
......
...@@ -134,11 +134,6 @@ static inline void mlx5_wq_cyc_update_db_record(struct mlx5_wq_cyc *wq) ...@@ -134,11 +134,6 @@ static inline void mlx5_wq_cyc_update_db_record(struct mlx5_wq_cyc *wq)
*wq->db = cpu_to_be32(wq->wqe_ctr); *wq->db = cpu_to_be32(wq->wqe_ctr);
} }
static inline u16 mlx5_wq_cyc_get_ctr_wrap_cnt(struct mlx5_wq_cyc *wq, u16 ctr)
{
return ctr >> wq->fbc.log_sz;
}
static inline u16 mlx5_wq_cyc_ctr2ix(struct mlx5_wq_cyc *wq, u16 ctr) static inline u16 mlx5_wq_cyc_ctr2ix(struct mlx5_wq_cyc *wq, u16 ctr)
{ {
return ctr & wq->fbc.sz_m1; return ctr & wq->fbc.sz_m1;
......
...@@ -38,6 +38,8 @@ ...@@ -38,6 +38,8 @@
#define VETH_XDP_TX BIT(0) #define VETH_XDP_TX BIT(0)
#define VETH_XDP_REDIR BIT(1) #define VETH_XDP_REDIR BIT(1)
#define VETH_XDP_TX_BULK_SIZE 16
struct veth_rq_stats { struct veth_rq_stats {
u64 xdp_packets; u64 xdp_packets;
u64 xdp_bytes; u64 xdp_bytes;
...@@ -64,6 +66,11 @@ struct veth_priv { ...@@ -64,6 +66,11 @@ struct veth_priv {
unsigned int requested_headroom; unsigned int requested_headroom;
}; };
struct veth_xdp_tx_bq {
struct xdp_frame *q[VETH_XDP_TX_BULK_SIZE];
unsigned int count;
};
/* /*
* ethtool interface * ethtool interface
*/ */
...@@ -442,13 +449,30 @@ static int veth_xdp_xmit(struct net_device *dev, int n, ...@@ -442,13 +449,30 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
return ret; return ret;
} }
static void veth_xdp_flush(struct net_device *dev) static void veth_xdp_flush_bq(struct net_device *dev, struct veth_xdp_tx_bq *bq)
{
int sent, i, err = 0;
sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
if (sent < 0) {
err = sent;
sent = 0;
for (i = 0; i < bq->count; i++)
xdp_return_frame(bq->q[i]);
}
trace_xdp_bulk_tx(dev, sent, bq->count - sent, err);
bq->count = 0;
}
static void veth_xdp_flush(struct net_device *dev, struct veth_xdp_tx_bq *bq)
{ {
struct veth_priv *rcv_priv, *priv = netdev_priv(dev); struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
struct net_device *rcv; struct net_device *rcv;
struct veth_rq *rq; struct veth_rq *rq;
rcu_read_lock(); rcu_read_lock();
veth_xdp_flush_bq(dev, bq);
rcv = rcu_dereference(priv->peer); rcv = rcu_dereference(priv->peer);
if (unlikely(!rcv)) if (unlikely(!rcv))
goto out; goto out;
...@@ -464,19 +488,26 @@ static void veth_xdp_flush(struct net_device *dev) ...@@ -464,19 +488,26 @@ static void veth_xdp_flush(struct net_device *dev)
rcu_read_unlock(); rcu_read_unlock();
} }
static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp) static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp,
struct veth_xdp_tx_bq *bq)
{ {
struct xdp_frame *frame = convert_to_xdp_frame(xdp); struct xdp_frame *frame = convert_to_xdp_frame(xdp);
if (unlikely(!frame)) if (unlikely(!frame))
return -EOVERFLOW; return -EOVERFLOW;
return veth_xdp_xmit(dev, 1, &frame, 0); if (unlikely(bq->count == VETH_XDP_TX_BULK_SIZE))
veth_xdp_flush_bq(dev, bq);
bq->q[bq->count++] = frame;
return 0;
} }
static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq, static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
struct xdp_frame *frame, struct xdp_frame *frame,
unsigned int *xdp_xmit) unsigned int *xdp_xmit,
struct veth_xdp_tx_bq *bq)
{ {
void *hard_start = frame->data - frame->headroom; void *hard_start = frame->data - frame->headroom;
void *head = hard_start - sizeof(struct xdp_frame); void *head = hard_start - sizeof(struct xdp_frame);
...@@ -509,7 +540,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq, ...@@ -509,7 +540,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
orig_frame = *frame; orig_frame = *frame;
xdp.data_hard_start = head; xdp.data_hard_start = head;
xdp.rxq->mem = frame->mem; xdp.rxq->mem = frame->mem;
if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) { if (unlikely(veth_xdp_tx(rq->dev, &xdp, bq) < 0)) {
trace_xdp_exception(rq->dev, xdp_prog, act); trace_xdp_exception(rq->dev, xdp_prog, act);
frame = &orig_frame; frame = &orig_frame;
goto err_xdp; goto err_xdp;
...@@ -560,7 +591,8 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq, ...@@ -560,7 +591,8 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
} }
static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb, static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb,
unsigned int *xdp_xmit) unsigned int *xdp_xmit,
struct veth_xdp_tx_bq *bq)
{ {
u32 pktlen, headroom, act, metalen; u32 pktlen, headroom, act, metalen;
void *orig_data, *orig_data_end; void *orig_data, *orig_data_end;
...@@ -636,7 +668,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb, ...@@ -636,7 +668,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb,
get_page(virt_to_page(xdp.data)); get_page(virt_to_page(xdp.data));
consume_skb(skb); consume_skb(skb);
xdp.rxq->mem = rq->xdp_mem; xdp.rxq->mem = rq->xdp_mem;
if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) { if (unlikely(veth_xdp_tx(rq->dev, &xdp, bq) < 0)) {
trace_xdp_exception(rq->dev, xdp_prog, act); trace_xdp_exception(rq->dev, xdp_prog, act);
goto err_xdp; goto err_xdp;
} }
...@@ -691,7 +723,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb, ...@@ -691,7 +723,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb,
return NULL; return NULL;
} }
static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit) static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit,
struct veth_xdp_tx_bq *bq)
{ {
int i, done = 0, drops = 0, bytes = 0; int i, done = 0, drops = 0, bytes = 0;
...@@ -707,11 +740,11 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit) ...@@ -707,11 +740,11 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit)
struct xdp_frame *frame = veth_ptr_to_xdp(ptr); struct xdp_frame *frame = veth_ptr_to_xdp(ptr);
bytes += frame->len; bytes += frame->len;
skb = veth_xdp_rcv_one(rq, frame, &xdp_xmit_one); skb = veth_xdp_rcv_one(rq, frame, &xdp_xmit_one, bq);
} else { } else {
skb = ptr; skb = ptr;
bytes += skb->len; bytes += skb->len;
skb = veth_xdp_rcv_skb(rq, skb, &xdp_xmit_one); skb = veth_xdp_rcv_skb(rq, skb, &xdp_xmit_one, bq);
} }
*xdp_xmit |= xdp_xmit_one; *xdp_xmit |= xdp_xmit_one;
...@@ -737,10 +770,13 @@ static int veth_poll(struct napi_struct *napi, int budget) ...@@ -737,10 +770,13 @@ static int veth_poll(struct napi_struct *napi, int budget)
struct veth_rq *rq = struct veth_rq *rq =
container_of(napi, struct veth_rq, xdp_napi); container_of(napi, struct veth_rq, xdp_napi);
unsigned int xdp_xmit = 0; unsigned int xdp_xmit = 0;
struct veth_xdp_tx_bq bq;
int done; int done;
bq.count = 0;
xdp_set_return_frame_no_direct(); xdp_set_return_frame_no_direct();
done = veth_xdp_rcv(rq, budget, &xdp_xmit); done = veth_xdp_rcv(rq, budget, &xdp_xmit, &bq);
if (done < budget && napi_complete_done(napi, done)) { if (done < budget && napi_complete_done(napi, done)) {
/* Write rx_notify_masked before reading ptr_ring */ /* Write rx_notify_masked before reading ptr_ring */
...@@ -752,7 +788,7 @@ static int veth_poll(struct napi_struct *napi, int budget) ...@@ -752,7 +788,7 @@ static int veth_poll(struct napi_struct *napi, int budget)
} }
if (xdp_xmit & VETH_XDP_TX) if (xdp_xmit & VETH_XDP_TX)
veth_xdp_flush(rq->dev); veth_xdp_flush(rq->dev, &bq);
if (xdp_xmit & VETH_XDP_REDIR) if (xdp_xmit & VETH_XDP_REDIR)
xdp_do_flush_map(); xdp_do_flush_map();
xdp_clear_return_frame_no_direct(); xdp_clear_return_frame_no_direct();
......
...@@ -124,6 +124,14 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head, ...@@ -124,6 +124,14 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
loff_t *ppos, void **new_buf, loff_t *ppos, void **new_buf,
enum bpf_attach_type type); enum bpf_attach_type type);
int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int *level,
int *optname, char __user *optval,
int *optlen, char **kernel_optval);
int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
int optname, char __user *optval,
int __user *optlen, int max_optlen,
int retval);
static inline enum bpf_cgroup_storage_type cgroup_storage_type( static inline enum bpf_cgroup_storage_type cgroup_storage_type(
struct bpf_map *map) struct bpf_map *map)
{ {
...@@ -286,6 +294,38 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, ...@@ -286,6 +294,38 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
__ret; \ __ret; \
}) })
#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen, \
kernel_optval) \
({ \
int __ret = 0; \
if (cgroup_bpf_enabled) \
__ret = __cgroup_bpf_run_filter_setsockopt(sock, level, \
optname, optval, \
optlen, \
kernel_optval); \
__ret; \
})
#define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) \
({ \
int __ret = 0; \
if (cgroup_bpf_enabled) \
get_user(__ret, optlen); \
__ret; \
})
#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, optlen, \
max_optlen, retval) \
({ \
int __ret = retval; \
if (cgroup_bpf_enabled) \
__ret = __cgroup_bpf_run_filter_getsockopt(sock, level, \
optname, optval, \
optlen, max_optlen, \
retval); \
__ret; \
})
int cgroup_bpf_prog_attach(const union bpf_attr *attr, int cgroup_bpf_prog_attach(const union bpf_attr *attr,
enum bpf_prog_type ptype, struct bpf_prog *prog); enum bpf_prog_type ptype, struct bpf_prog *prog);
int cgroup_bpf_prog_detach(const union bpf_attr *attr, int cgroup_bpf_prog_detach(const union bpf_attr *attr,
...@@ -357,6 +397,11 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map, ...@@ -357,6 +397,11 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map,
#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; }) #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
#define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; }) #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
#define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; }) #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; })
#define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) ({ 0; })
#define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, \
optlen, max_optlen, retval) ({ retval; })
#define BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock, level, optname, optval, optlen, \
kernel_optval) ({ 0; })
#define for_each_cgroup_storage_type(stype) for (; false; ) #define for_each_cgroup_storage_type(stype) for (; false; )
......
...@@ -518,6 +518,7 @@ struct bpf_prog_array { ...@@ -518,6 +518,7 @@ struct bpf_prog_array {
struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags); struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags);
void bpf_prog_array_free(struct bpf_prog_array *progs); void bpf_prog_array_free(struct bpf_prog_array *progs);
int bpf_prog_array_length(struct bpf_prog_array *progs); int bpf_prog_array_length(struct bpf_prog_array *progs);
bool bpf_prog_array_is_empty(struct bpf_prog_array *array);
int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs, int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
__u32 __user *prog_ids, u32 cnt); __u32 __user *prog_ids, u32 cnt);
...@@ -1051,6 +1052,7 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto; ...@@ -1051,6 +1052,7 @@ extern const struct bpf_func_proto bpf_spin_unlock_proto;
extern const struct bpf_func_proto bpf_get_local_storage_proto; extern const struct bpf_func_proto bpf_get_local_storage_proto;
extern const struct bpf_func_proto bpf_strtol_proto; extern const struct bpf_func_proto bpf_strtol_proto;
extern const struct bpf_func_proto bpf_strtoul_proto; extern const struct bpf_func_proto bpf_strtoul_proto;
extern const struct bpf_func_proto bpf_tcp_sock_proto;
/* Shared helpers among cBPF and eBPF. */ /* Shared helpers among cBPF and eBPF. */
void bpf_user_rnd_init_once(void); void bpf_user_rnd_init_once(void);
......
...@@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable) ...@@ -30,6 +30,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
#ifdef CONFIG_CGROUP_BPF #ifdef CONFIG_CGROUP_BPF
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl) BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl)
BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt)
#endif #endif
#ifdef CONFIG_BPF_LIRC_MODE2 #ifdef CONFIG_BPF_LIRC_MODE2
BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2) BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
......
...@@ -578,8 +578,9 @@ struct bpf_skb_data_end { ...@@ -578,8 +578,9 @@ struct bpf_skb_data_end {
}; };
struct bpf_redirect_info { struct bpf_redirect_info {
u32 ifindex;
u32 flags; u32 flags;
u32 tgt_index;
void *tgt_value;
struct bpf_map *map; struct bpf_map *map;
struct bpf_map *map_to_flush; struct bpf_map *map_to_flush;
u32 kern_flags; u32 kern_flags;
...@@ -1199,4 +1200,14 @@ struct bpf_sysctl_kern { ...@@ -1199,4 +1200,14 @@ struct bpf_sysctl_kern {
u64 tmp_reg; u64 tmp_reg;
}; };
struct bpf_sockopt_kern {
struct sock *sk;
u8 *optval;
u8 *optval_end;
s32 level;
s32 optname;
s32 optlen;
s32 retval;
};
#endif /* __LINUX_FILTER_H__ */ #endif /* __LINUX_FILTER_H__ */
...@@ -106,6 +106,20 @@ static inline void __list_del(struct list_head * prev, struct list_head * next) ...@@ -106,6 +106,20 @@ static inline void __list_del(struct list_head * prev, struct list_head * next)
WRITE_ONCE(prev->next, next); WRITE_ONCE(prev->next, next);
} }
/*
* Delete a list entry and clear the 'prev' pointer.
*
* This is a special-purpose list clearing method used in the networking code
* for lists allocated as per-cpu, where we don't want to incur the extra
* WRITE_ONCE() overhead of a regular list_del_init(). The code that uses this
* needs to check the node 'prev' pointer instead of calling list_empty().
*/
static inline void __list_del_clearprev(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
entry->prev = NULL;
}
/** /**
* list_del - deletes entry from list. * list_del - deletes entry from list.
* @entry: the element to delete from the list. * @entry: the element to delete from the list.
......
...@@ -2221,6 +2221,14 @@ static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk) ...@@ -2221,6 +2221,14 @@ static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1); return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
} }
static inline void tcp_bpf_rtt(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTT_CB_FLAG))
tcp_call_bpf(sk, BPF_SOCK_OPS_RTT_CB, 0, NULL);
}
#if IS_ENABLED(CONFIG_SMC) #if IS_ENABLED(CONFIG_SMC)
extern struct static_key_false tcp_have_smc; extern struct static_key_false tcp_have_smc;
#endif #endif
......
...@@ -77,10 +77,11 @@ int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp); ...@@ -77,10 +77,11 @@ int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
void xsk_flush(struct xdp_sock *xs); void xsk_flush(struct xdp_sock *xs);
bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs); bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
/* Used from netdev driver */ /* Used from netdev driver */
bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt);
u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr); u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
void xsk_umem_discard_addr(struct xdp_umem *umem); void xsk_umem_discard_addr(struct xdp_umem *umem);
void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries); void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len); bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc);
void xsk_umem_consume_tx_done(struct xdp_umem *umem); void xsk_umem_consume_tx_done(struct xdp_umem *umem);
struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries); struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem, struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
...@@ -99,6 +100,16 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr) ...@@ -99,6 +100,16 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
} }
/* Reuse-queue aware version of FILL queue helpers */ /* Reuse-queue aware version of FILL queue helpers */
static inline bool xsk_umem_has_addrs_rq(struct xdp_umem *umem, u32 cnt)
{
struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
if (rq->length >= cnt)
return true;
return xsk_umem_has_addrs(umem, cnt - rq->length);
}
static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr) static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
{ {
struct xdp_umem_fq_reuse *rq = umem->fq_reuse; struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
...@@ -146,6 +157,11 @@ static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs) ...@@ -146,6 +157,11 @@ static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
return false; return false;
} }
static inline bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt)
{
return false;
}
static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr) static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
{ {
return NULL; return NULL;
...@@ -159,8 +175,8 @@ static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries) ...@@ -159,8 +175,8 @@ static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
{ {
} }
static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, static inline bool xsk_umem_consume_tx(struct xdp_umem *umem,
u32 *len) struct xdp_desc *desc)
{ {
return false; return false;
} }
...@@ -200,6 +216,11 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr) ...@@ -200,6 +216,11 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
return 0; return 0;
} }
static inline bool xsk_umem_has_addrs_rq(struct xdp_umem *umem, u32 cnt)
{
return false;
}
static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr) static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
{ {
return NULL; return NULL;
......
...@@ -50,6 +50,35 @@ TRACE_EVENT(xdp_exception, ...@@ -50,6 +50,35 @@ TRACE_EVENT(xdp_exception,
__entry->ifindex) __entry->ifindex)
); );
TRACE_EVENT(xdp_bulk_tx,
TP_PROTO(const struct net_device *dev,
int sent, int drops, int err),
TP_ARGS(dev, sent, drops, err),
TP_STRUCT__entry(
__field(int, ifindex)
__field(u32, act)
__field(int, drops)
__field(int, sent)
__field(int, err)
),
TP_fast_assign(
__entry->ifindex = dev->ifindex;
__entry->act = XDP_TX;
__entry->drops = drops;
__entry->sent = sent;
__entry->err = err;
),
TP_printk("ifindex=%d action=%s sent=%d drops=%d err=%d",
__entry->ifindex,
__print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
__entry->sent, __entry->drops, __entry->err)
);
DECLARE_EVENT_CLASS(xdp_redirect_template, DECLARE_EVENT_CLASS(xdp_redirect_template,
TP_PROTO(const struct net_device *dev, TP_PROTO(const struct net_device *dev,
...@@ -146,9 +175,8 @@ struct _bpf_dtab_netdev { ...@@ -146,9 +175,8 @@ struct _bpf_dtab_netdev {
#endif /* __DEVMAP_OBJ_TYPE */ #endif /* __DEVMAP_OBJ_TYPE */
#define devmap_ifindex(fwd, map) \ #define devmap_ifindex(fwd, map) \
(!fwd ? 0 : \
((map->map_type == BPF_MAP_TYPE_DEVMAP) ? \ ((map->map_type == BPF_MAP_TYPE_DEVMAP) ? \
((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0)) ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0)
#define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx) \ #define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx) \
trace_xdp_redirect_map(dev, xdp, devmap_ifindex(fwd, map), \ trace_xdp_redirect_map(dev, xdp, devmap_ifindex(fwd, map), \
......
...@@ -170,6 +170,7 @@ enum bpf_prog_type { ...@@ -170,6 +170,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_FLOW_DISSECTOR, BPF_PROG_TYPE_FLOW_DISSECTOR,
BPF_PROG_TYPE_CGROUP_SYSCTL, BPF_PROG_TYPE_CGROUP_SYSCTL,
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
BPF_PROG_TYPE_CGROUP_SOCKOPT,
}; };
enum bpf_attach_type { enum bpf_attach_type {
...@@ -194,6 +195,8 @@ enum bpf_attach_type { ...@@ -194,6 +195,8 @@ enum bpf_attach_type {
BPF_CGROUP_SYSCTL, BPF_CGROUP_SYSCTL,
BPF_CGROUP_UDP4_RECVMSG, BPF_CGROUP_UDP4_RECVMSG,
BPF_CGROUP_UDP6_RECVMSG, BPF_CGROUP_UDP6_RECVMSG,
BPF_CGROUP_GETSOCKOPT,
BPF_CGROUP_SETSOCKOPT,
__MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
}; };
...@@ -1568,8 +1571,11 @@ union bpf_attr { ...@@ -1568,8 +1571,11 @@ union bpf_attr {
* but this is only implemented for native XDP (with driver * but this is only implemented for native XDP (with driver
* support) as of this writing). * support) as of this writing).
* *
* All values for *flags* are reserved for future usage, and must * The lower two bits of *flags* are used as the return code if
* be left at zero. * the map lookup fails. This is so that the return value can be
* one of the XDP program return codes up to XDP_TX, as chosen by
* the caller. Any higher bits in the *flags* argument must be
* unset.
* *
* When used to redirect packets to net devices, this helper * When used to redirect packets to net devices, this helper
* provides a high performance increase over **bpf_redirect**\ (). * provides a high performance increase over **bpf_redirect**\ ().
...@@ -1764,6 +1770,7 @@ union bpf_attr { ...@@ -1764,6 +1770,7 @@ union bpf_attr {
* * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out) * * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out)
* * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission) * * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission)
* * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change) * * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change)
* * **BPF_SOCK_OPS_RTT_CB_FLAG** (every RTT)
* *
* Therefore, this function can be used to clear a callback flag by * Therefore, this function can be used to clear a callback flag by
* setting the appropriate bit to zero. e.g. to disable the RTO * setting the appropriate bit to zero. e.g. to disable the RTO
...@@ -3066,6 +3073,12 @@ struct bpf_tcp_sock { ...@@ -3066,6 +3073,12 @@ struct bpf_tcp_sock {
* sum(delta(snd_una)), or how many bytes * sum(delta(snd_una)), or how many bytes
* were acked. * were acked.
*/ */
__u32 dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups
* total number of DSACK blocks received
*/
__u32 delivered; /* Total data packets delivered incl. rexmits */
__u32 delivered_ce; /* Like the above but only ECE marked packets */
__u32 icsk_retransmits; /* Number of unrecovered [RTO] timeouts */
}; };
struct bpf_sock_tuple { struct bpf_sock_tuple {
...@@ -3308,7 +3321,8 @@ struct bpf_sock_ops { ...@@ -3308,7 +3321,8 @@ struct bpf_sock_ops {
#define BPF_SOCK_OPS_RTO_CB_FLAG (1<<0) #define BPF_SOCK_OPS_RTO_CB_FLAG (1<<0)
#define BPF_SOCK_OPS_RETRANS_CB_FLAG (1<<1) #define BPF_SOCK_OPS_RETRANS_CB_FLAG (1<<1)
#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2) #define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
#define BPF_SOCK_OPS_ALL_CB_FLAGS 0x7 /* Mask of all currently #define BPF_SOCK_OPS_RTT_CB_FLAG (1<<3)
#define BPF_SOCK_OPS_ALL_CB_FLAGS 0xF /* Mask of all currently
* supported cb flags * supported cb flags
*/ */
...@@ -3363,6 +3377,8 @@ enum { ...@@ -3363,6 +3377,8 @@ enum {
BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after
* socket transition to LISTEN state. * socket transition to LISTEN state.
*/ */
BPF_SOCK_OPS_RTT_CB, /* Called on every RTT.
*/
}; };
/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
...@@ -3541,4 +3557,15 @@ struct bpf_sysctl { ...@@ -3541,4 +3557,15 @@ struct bpf_sysctl {
*/ */
}; };
struct bpf_sockopt {
__bpf_md_ptr(struct bpf_sock *, sk);
__bpf_md_ptr(void *, optval);
__bpf_md_ptr(void *, optval_end);
__s32 level;
__s32 optname;
__s32 optlen;
__s32 retval;
};
#endif /* _UAPI__LINUX_BPF_H__ */ #endif /* _UAPI__LINUX_BPF_H__ */
...@@ -46,6 +46,7 @@ struct xdp_mmap_offsets { ...@@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
#define XDP_UMEM_FILL_RING 5 #define XDP_UMEM_FILL_RING 5
#define XDP_UMEM_COMPLETION_RING 6 #define XDP_UMEM_COMPLETION_RING 6
#define XDP_STATISTICS 7 #define XDP_STATISTICS 7
#define XDP_OPTIONS 8
struct xdp_umem_reg { struct xdp_umem_reg {
__u64 addr; /* Start of packet data area */ __u64 addr; /* Start of packet data area */
...@@ -60,6 +61,13 @@ struct xdp_statistics { ...@@ -60,6 +61,13 @@ struct xdp_statistics {
__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
}; };
struct xdp_options {
__u32 flags;
};
/* Flags for the flags field of struct xdp_options */
#define XDP_OPTIONS_ZEROCOPY (1 << 0)
/* Pgoff for mmaping the rings */ /* Pgoff for mmaping the rings */
#define XDP_PGOFF_RX_RING 0 #define XDP_PGOFF_RX_RING 0
#define XDP_PGOFF_TX_RING 0x80000000 #define XDP_PGOFF_TX_RING 0x80000000
......
This diff is collapsed.
...@@ -1809,6 +1809,15 @@ int bpf_prog_array_length(struct bpf_prog_array *array) ...@@ -1809,6 +1809,15 @@ int bpf_prog_array_length(struct bpf_prog_array *array)
return cnt; return cnt;
} }
bool bpf_prog_array_is_empty(struct bpf_prog_array *array)
{
struct bpf_prog_array_item *item;
for (item = array->items; item->prog; item++)
if (item->prog != &dummy_bpf_prog.prog)
return false;
return true;
}
static bool bpf_prog_array_copy_core(struct bpf_prog_array *array, static bool bpf_prog_array_copy_core(struct bpf_prog_array *array,
u32 *prog_ids, u32 *prog_ids,
...@@ -2101,3 +2110,4 @@ EXPORT_SYMBOL(bpf_stats_enabled_key); ...@@ -2101,3 +2110,4 @@ EXPORT_SYMBOL(bpf_stats_enabled_key);
#include <linux/bpf_trace.h> #include <linux/bpf_trace.h>
EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception); EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception);
EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_bulk_tx);
...@@ -32,14 +32,19 @@ ...@@ -32,14 +32,19 @@
/* General idea: XDP packets getting XDP redirected to another CPU, /* General idea: XDP packets getting XDP redirected to another CPU,
* will maximum be stored/queued for one driver ->poll() call. It is * will maximum be stored/queued for one driver ->poll() call. It is
* guaranteed that setting flush bit and flush operation happen on * guaranteed that queueing the frame and the flush operation happen on
* same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr() * same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
* which queue in bpf_cpu_map_entry contains packets. * which queue in bpf_cpu_map_entry contains packets.
*/ */
#define CPU_MAP_BULK_SIZE 8 /* 8 == one cacheline on 64-bit archs */ #define CPU_MAP_BULK_SIZE 8 /* 8 == one cacheline on 64-bit archs */
struct bpf_cpu_map_entry;
struct bpf_cpu_map;
struct xdp_bulk_queue { struct xdp_bulk_queue {
void *q[CPU_MAP_BULK_SIZE]; void *q[CPU_MAP_BULK_SIZE];
struct list_head flush_node;
struct bpf_cpu_map_entry *obj;
unsigned int count; unsigned int count;
}; };
...@@ -52,6 +57,8 @@ struct bpf_cpu_map_entry { ...@@ -52,6 +57,8 @@ struct bpf_cpu_map_entry {
/* XDP can run multiple RX-ring queues, need __percpu enqueue store */ /* XDP can run multiple RX-ring queues, need __percpu enqueue store */
struct xdp_bulk_queue __percpu *bulkq; struct xdp_bulk_queue __percpu *bulkq;
struct bpf_cpu_map *cmap;
/* Queue with potential multi-producers, and single-consumer kthread */ /* Queue with potential multi-producers, and single-consumer kthread */
struct ptr_ring *queue; struct ptr_ring *queue;
struct task_struct *kthread; struct task_struct *kthread;
...@@ -65,23 +72,17 @@ struct bpf_cpu_map { ...@@ -65,23 +72,17 @@ struct bpf_cpu_map {
struct bpf_map map; struct bpf_map map;
/* Below members specific for map type */ /* Below members specific for map type */
struct bpf_cpu_map_entry **cpu_map; struct bpf_cpu_map_entry **cpu_map;
unsigned long __percpu *flush_needed; struct list_head __percpu *flush_list;
}; };
static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx);
struct xdp_bulk_queue *bq, bool in_napi_ctx);
static u64 cpu_map_bitmap_size(const union bpf_attr *attr)
{
return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long);
}
static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
{ {
struct bpf_cpu_map *cmap; struct bpf_cpu_map *cmap;
int err = -ENOMEM; int err = -ENOMEM;
int ret, cpu;
u64 cost; u64 cost;
int ret;
if (!capable(CAP_SYS_ADMIN)) if (!capable(CAP_SYS_ADMIN))
return ERR_PTR(-EPERM); return ERR_PTR(-EPERM);
...@@ -105,7 +106,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) ...@@ -105,7 +106,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
/* make sure page count doesn't overflow */ /* make sure page count doesn't overflow */
cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *); cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
cost += cpu_map_bitmap_size(attr) * num_possible_cpus(); cost += sizeof(struct list_head) * num_possible_cpus();
/* Notice returns -EPERM on if map size is larger than memlock limit */ /* Notice returns -EPERM on if map size is larger than memlock limit */
ret = bpf_map_charge_init(&cmap->map.memory, cost); ret = bpf_map_charge_init(&cmap->map.memory, cost);
...@@ -114,12 +115,13 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) ...@@ -114,12 +115,13 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
goto free_cmap; goto free_cmap;
} }
/* A per cpu bitfield with a bit per possible CPU in map */ cmap->flush_list = alloc_percpu(struct list_head);
cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr), if (!cmap->flush_list)
__alignof__(unsigned long));
if (!cmap->flush_needed)
goto free_charge; goto free_charge;
for_each_possible_cpu(cpu)
INIT_LIST_HEAD(per_cpu_ptr(cmap->flush_list, cpu));
/* Alloc array for possible remote "destination" CPUs */ /* Alloc array for possible remote "destination" CPUs */
cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries * cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
sizeof(struct bpf_cpu_map_entry *), sizeof(struct bpf_cpu_map_entry *),
...@@ -129,7 +131,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) ...@@ -129,7 +131,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
return &cmap->map; return &cmap->map;
free_percpu: free_percpu:
free_percpu(cmap->flush_needed); free_percpu(cmap->flush_list);
free_charge: free_charge:
bpf_map_charge_finish(&cmap->map.memory); bpf_map_charge_finish(&cmap->map.memory);
free_cmap: free_cmap:
...@@ -334,7 +336,8 @@ static struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, ...@@ -334,7 +336,8 @@ static struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu,
{ {
gfp_t gfp = GFP_KERNEL | __GFP_NOWARN; gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
struct bpf_cpu_map_entry *rcpu; struct bpf_cpu_map_entry *rcpu;
int numa, err; struct xdp_bulk_queue *bq;
int numa, err, i;
/* Have map->numa_node, but choose node of redirect target CPU */ /* Have map->numa_node, but choose node of redirect target CPU */
numa = cpu_to_node(cpu); numa = cpu_to_node(cpu);
...@@ -349,6 +352,11 @@ static struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, ...@@ -349,6 +352,11 @@ static struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu,
if (!rcpu->bulkq) if (!rcpu->bulkq)
goto free_rcu; goto free_rcu;
for_each_possible_cpu(i) {
bq = per_cpu_ptr(rcpu->bulkq, i);
bq->obj = rcpu;
}
/* Alloc queue */ /* Alloc queue */
rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa); rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa);
if (!rcpu->queue) if (!rcpu->queue)
...@@ -405,7 +413,7 @@ static void __cpu_map_entry_free(struct rcu_head *rcu) ...@@ -405,7 +413,7 @@ static void __cpu_map_entry_free(struct rcu_head *rcu)
struct xdp_bulk_queue *bq = per_cpu_ptr(rcpu->bulkq, cpu); struct xdp_bulk_queue *bq = per_cpu_ptr(rcpu->bulkq, cpu);
/* No concurrent bq_enqueue can run at this point */ /* No concurrent bq_enqueue can run at this point */
bq_flush_to_queue(rcpu, bq, false); bq_flush_to_queue(bq, false);
} }
free_percpu(rcpu->bulkq); free_percpu(rcpu->bulkq);
/* Cannot kthread_stop() here, last put free rcpu resources */ /* Cannot kthread_stop() here, last put free rcpu resources */
...@@ -488,6 +496,7 @@ static int cpu_map_update_elem(struct bpf_map *map, void *key, void *value, ...@@ -488,6 +496,7 @@ static int cpu_map_update_elem(struct bpf_map *map, void *key, void *value,
rcpu = __cpu_map_entry_alloc(qsize, key_cpu, map->id); rcpu = __cpu_map_entry_alloc(qsize, key_cpu, map->id);
if (!rcpu) if (!rcpu)
return -ENOMEM; return -ENOMEM;
rcpu->cmap = cmap;
} }
rcu_read_lock(); rcu_read_lock();
__cpu_map_entry_replace(cmap, key_cpu, rcpu); __cpu_map_entry_replace(cmap, key_cpu, rcpu);
...@@ -514,14 +523,14 @@ static void cpu_map_free(struct bpf_map *map) ...@@ -514,14 +523,14 @@ static void cpu_map_free(struct bpf_map *map)
synchronize_rcu(); synchronize_rcu();
/* To ensure all pending flush operations have completed wait for flush /* To ensure all pending flush operations have completed wait for flush
* bitmap to indicate all flush_needed bits to be zero on _all_ cpus. * list be empty on _all_ cpus. Because the above synchronize_rcu()
* Because the above synchronize_rcu() ensures the map is disconnected * ensures the map is disconnected from the program we can assume no new
* from the program we can assume no new bits will be set. * items will be added to the list.
*/ */
for_each_online_cpu(cpu) { for_each_online_cpu(cpu) {
unsigned long *bitmap = per_cpu_ptr(cmap->flush_needed, cpu); struct list_head *flush_list = per_cpu_ptr(cmap->flush_list, cpu);
while (!bitmap_empty(bitmap, cmap->map.max_entries)) while (!list_empty(flush_list))
cond_resched(); cond_resched();
} }
...@@ -538,7 +547,7 @@ static void cpu_map_free(struct bpf_map *map) ...@@ -538,7 +547,7 @@ static void cpu_map_free(struct bpf_map *map)
/* bq flush and cleanup happens after RCU graze-period */ /* bq flush and cleanup happens after RCU graze-period */
__cpu_map_entry_replace(cmap, i, NULL); /* call_rcu */ __cpu_map_entry_replace(cmap, i, NULL); /* call_rcu */
} }
free_percpu(cmap->flush_needed); free_percpu(cmap->flush_list);
bpf_map_area_free(cmap->cpu_map); bpf_map_area_free(cmap->cpu_map);
kfree(cmap); kfree(cmap);
} }
...@@ -590,9 +599,9 @@ const struct bpf_map_ops cpu_map_ops = { ...@@ -590,9 +599,9 @@ const struct bpf_map_ops cpu_map_ops = {
.map_check_btf = map_check_no_btf, .map_check_btf = map_check_no_btf,
}; };
static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx)
struct xdp_bulk_queue *bq, bool in_napi_ctx)
{ {
struct bpf_cpu_map_entry *rcpu = bq->obj;
unsigned int processed = 0, drops = 0; unsigned int processed = 0, drops = 0;
const int to_cpu = rcpu->cpu; const int to_cpu = rcpu->cpu;
struct ptr_ring *q; struct ptr_ring *q;
...@@ -621,6 +630,8 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, ...@@ -621,6 +630,8 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
bq->count = 0; bq->count = 0;
spin_unlock(&q->producer_lock); spin_unlock(&q->producer_lock);
__list_del_clearprev(&bq->flush_node);
/* Feedback loop via tracepoints */ /* Feedback loop via tracepoints */
trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu); trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu);
return 0; return 0;
...@@ -631,10 +642,11 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, ...@@ -631,10 +642,11 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
*/ */
static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
{ {
struct list_head *flush_list = this_cpu_ptr(rcpu->cmap->flush_list);
struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq); struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
if (unlikely(bq->count == CPU_MAP_BULK_SIZE)) if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
bq_flush_to_queue(rcpu, bq, true); bq_flush_to_queue(bq, true);
/* Notice, xdp_buff/page MUST be queued here, long enough for /* Notice, xdp_buff/page MUST be queued here, long enough for
* driver to code invoking us to finished, due to driver * driver to code invoking us to finished, due to driver
...@@ -646,6 +658,10 @@ static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) ...@@ -646,6 +658,10 @@ static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
* operation, when completing napi->poll call. * operation, when completing napi->poll call.
*/ */
bq->q[bq->count++] = xdpf; bq->q[bq->count++] = xdpf;
if (!bq->flush_node.prev)
list_add(&bq->flush_node, flush_list);
return 0; return 0;
} }
...@@ -665,41 +681,16 @@ int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp, ...@@ -665,41 +681,16 @@ int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
return 0; return 0;
} }
void __cpu_map_insert_ctx(struct bpf_map *map, u32 bit)
{
struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
unsigned long *bitmap = this_cpu_ptr(cmap->flush_needed);
__set_bit(bit, bitmap);
}
void __cpu_map_flush(struct bpf_map *map) void __cpu_map_flush(struct bpf_map *map)
{ {
struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
unsigned long *bitmap = this_cpu_ptr(cmap->flush_needed); struct list_head *flush_list = this_cpu_ptr(cmap->flush_list);
u32 bit; struct xdp_bulk_queue *bq, *tmp;
/* The napi->poll softirq makes sure __cpu_map_insert_ctx() list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
* and __cpu_map_flush() happen on same CPU. Thus, the percpu bq_flush_to_queue(bq, true);
* bitmap indicate which percpu bulkq have packets.
*/
for_each_set_bit(bit, bitmap, map->max_entries) {
struct bpf_cpu_map_entry *rcpu = READ_ONCE(cmap->cpu_map[bit]);
struct xdp_bulk_queue *bq;
/* This is possible if entry is removed by user space
* between xdp redirect and flush op.
*/
if (unlikely(!rcpu))
continue;
__clear_bit(bit, bitmap);
/* Flush all frames in bulkq to real queue */
bq = this_cpu_ptr(rcpu->bulkq);
bq_flush_to_queue(rcpu, bq, true);
/* If already running, costs spin_lock_irqsave + smb_mb */ /* If already running, costs spin_lock_irqsave + smb_mb */
wake_up_process(rcpu->kthread); wake_up_process(bq->obj->kthread);
} }
} }
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -145,8 +145,7 @@ void __xsk_map_flush(struct bpf_map *map) ...@@ -145,8 +145,7 @@ void __xsk_map_flush(struct bpf_map *map)
list_for_each_entry_safe(xs, tmp, flush_list, flush_node) { list_for_each_entry_safe(xs, tmp, flush_list, flush_node) {
xsk_flush(xs); xsk_flush(xs);
__list_del(xs->flush_node.prev, xs->flush_node.next); __list_del_clearprev(&xs->flush_node);
xs->flush_node.prev = NULL;
} }
} }
......
This diff is collapsed.
This diff is collapsed.
...@@ -85,7 +85,7 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu) ...@@ -85,7 +85,7 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
kfree(xa); kfree(xa);
} }
bool __mem_id_disconnect(int id, bool force) static bool __mem_id_disconnect(int id, bool force)
{ {
struct xdp_mem_allocator *xa; struct xdp_mem_allocator *xa;
bool safe_to_remove = true; bool safe_to_remove = true;
......
...@@ -778,6 +778,8 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us) ...@@ -778,6 +778,8 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2; tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
tp->rtt_seq = tp->snd_nxt; tp->rtt_seq = tp->snd_nxt;
tp->mdev_max_us = tcp_rto_min_us(sk); tp->mdev_max_us = tcp_rto_min_us(sk);
tcp_bpf_rtt(sk);
} }
} else { } else {
/* no previous measure. */ /* no previous measure. */
...@@ -786,6 +788,8 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us) ...@@ -786,6 +788,8 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk)); tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
tp->mdev_max_us = tp->rttvar_us; tp->mdev_max_us = tp->rttvar_us;
tp->rtt_seq = tp->snd_nxt; tp->rtt_seq = tp->snd_nxt;
tcp_bpf_rtt(sk);
} }
tp->srtt_us = max(1U, srtt); tp->srtt_us = max(1U, srtt);
} }
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -25,4 +25,4 @@ attached to the cgroupv2). ...@@ -25,4 +25,4 @@ attached to the cgroupv2).
To remove (unattach) a socket_ops BPF program from a cgroupv2: To remove (unattach) a socket_ops BPF program from a cgroupv2:
bpftool cgroup attach /tmp/cgroupv2/foo sock_ops pinned /sys/fs/bpf/tcp_prog bpftool cgroup detach /tmp/cgroupv2/foo sock_ops pinned /sys/fs/bpf/tcp_prog
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -74,6 +74,7 @@ static const char * const prog_type_name[] = { ...@@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
[BPF_PROG_TYPE_SK_REUSEPORT] = "sk_reuseport", [BPF_PROG_TYPE_SK_REUSEPORT] = "sk_reuseport",
[BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector", [BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector",
[BPF_PROG_TYPE_CGROUP_SYSCTL] = "cgroup_sysctl", [BPF_PROG_TYPE_CGROUP_SYSCTL] = "cgroup_sysctl",
[BPF_PROG_TYPE_CGROUP_SOCKOPT] = "cgroup_sockopt",
}; };
extern const char * const map_type_name[]; extern const char * const map_type_name[];
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment