Commit 5ff5622e authored by David S. Miller's avatar David S. Miller

Merge branch 'NVMeTCP-Offload-ULP'

Shai Malin says:

====================
NVMeTCP Offload ULP

With the goal of enabling a generic infrastructure that allows NVMe/TCP
offload devices like NICs to seamlessly plug into the NVMe-oF stack, this
patch series introduces the nvme-tcp-offload ULP host layer, which will
be a new transport type called "tcp-offload" and will serve as an
abstraction layer to work with vendor specific nvme-tcp offload drivers.

NVMeTCP offload is a full offload of the NVMeTCP protocol, this includes
both the TCP level and the NVMeTCP level.

The nvme-tcp-offload transport can co-exist with the existing tcp and
other transports. The tcp offload was designed so that stack changes are
kept to a bare minimum: only registering new transports.
All other APIs, ops etc. are identical to the regular tcp transport.
Representing the TCP offload as a new transport allows clear and manageable
differentiation between the connections which should use the offload path
and those that are not offloaded (even on the same device).

The nvme-tcp-offload layers and API compared to nvme-tcp and nvme-rdma:

* NVMe layer: *

       [ nvme/nvme-fabrics/blk-mq ]
             |
        (nvme API and blk-mq API)
             |
             |
* Vendor agnostic transport layer: *

      [ nvme-rdma ] [ nvme-tcp ] [ nvme-tcp-offload ]
             |        |             |
           (Verbs)
             |        |             |
             |     (Socket)
             |        |             |
             |        |        (nvme-tcp-offload API)
             |        |             |
             |        |             |
* Vendor Specific Driver: *

             |        |             |
           [ qedr ]
                      |             |
                   [ qede ]
                                    |
                                  [ qedn ]

Performance:
============
With this implementation on top of the Marvell qedn driver (using the
Marvell FastLinQ NIC), we were able to demonstrate the following CPU
utilization improvement:

On AMD EPYC 7402, 2.80GHz, 28 cores:
- For 16K queued read IOs, 16jobs, 4qd (50Gbps line rate):
  Improved the CPU utilization from 15.1% with NVMeTCP SW to 4.7% with
  NVMeTCP offload.

On Intel(R) Xeon(R) Gold 5122 CPU, 3.60GHz, 16 cores:
- For 512K queued read IOs, 16jobs, 4qd (25Gbps line rate):
  Improved the CPU utilization from 16.3% with NVMeTCP SW to 1.1% with
  NVMeTCP offload.

In addition, we were able to demonstrate the following latency improvement:
- For 200K read IOPS (16 jobs, 16 qd, with fio rate limiter):
  Improved the average latency from 105 usec with NVMeTCP SW to 39 usec
  with NVMeTCP offload.

  Improved the 99.99 tail latency from 570 usec with NVMeTCP SW to 91 usec
  with NVMeTCP offload.

The end-to-end offload latency was measured from fio while running against
back end of null device.

Upstream plan:
==============
The RFC series "NVMeTCP Offload ULP and QEDN Device Driver"
https://lore.kernel.org/netdev/20210531225222.16992-1-smalin@marvell.com/
was designed in a modular way so that part 1 (nvme-tcp-offload) and
part 2 (qed) are independent and part 3 (qedn) depends on both parts 1+2.

- Part 1 (RFC patch 1-8): NVMeTCP Offload ULP
  The nvme-tcp-offload patches, will be sent to
  'linux-nvme@lists.infradead.org'.

- Part 2 (RFC patches 9-15): QED NVMeTCP Offload
  The qed infrastructure, will be sent to 'netdev@vger.kernel.org'.

Once part 1 and 2 are accepted:

- Part 3 (RFC patches 16-27): QEDN NVMeTCP Offload
  The qedn patches, will be sent to 'linux-nvme@lists.infradead.org'.

Marvell is fully committed to maintain, test, and address issues with
the new nvme-tcp-offload layer.

Usage:
======
With the Marvell NVMeTCP offload design, the network-device (qede) and the
offload-device (qedn) are paired on each port - Logically similar to the
RDMA model.
The user will interact with the network-device in order to configure
the ip/vlan. The NVMeTCP configuration is populated as part of the
nvme connect command.

Example:
Assign IP to the net-device (from any existing Linux tool):

    ip addr add 100.100.0.101/24 dev p1p1

This IP will be used by both net-device (qede) and offload-device (qedn).

In order to connect from "sw" nvme-tcp through the net-device (qede):

    nvme connect -t tcp -s 4420 -a 100.100.0.100 -n testnqn

In order to connect from "offload" nvme-tcp through the offload-device (qedn):

    nvme connect -t tcp_offload -s 4420 -a 100.100.0.100 -n testnqn

An alternative approach, and as a future enhancement that will not impact this
series will be to modify nvme-cli with a new flag that will determine
if "-t tcp" should be the regular nvme-tcp (which will be the default)
or nvme-tcp-offload.
Exmaple:
    nvme connect -t tcp -s 4420 -a 100.100.0.100 -n testnqn -[new flag]

Queue Initialization Design:
============================
The nvme-tcp-offload ULP module shall register with the existing
nvmf_transport_ops (.name = "tcp_offload"), nvme_ctrl_ops and blk_mq_ops.
The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following ops:
- claim_dev() - in order to resolve the route to the target according to
                the paired net_dev.
- create_queue() - in order to create offloaded nvme-tcp queue.

The nvme-tcp-offload ULP module shall manage all the controller level
functionalities, call claim_dev and based on the return values shall call
the relevant module create_queue in order to create the admin queue and
the IO queues.

IO-path Design:
===============
The nvme-tcp-offload shall work at the IO-level - the nvme-tcp-offload
ULP module shall pass the request (the IO) to the nvme-tcp-offload vendor
driver and later, the nvme-tcp-offload vendor driver returns the request
completion (the IO completion).
No additional handling is needed in between; this design will reduce the
CPU utilization as we will describe below.

The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following IO-path ops:
- send_req() - in order to pass the request to the handling of the
               offload driver that shall pass it to the vendor specific device.
- poll_queue()

Once the IO completes, the nvme-tcp-offload vendor driver shall call
command.done() that will invoke the nvme-tcp-offload ULP layer to
complete the request.

TCP events:
===========
The Marvell FastLinQ NIC HW engine handle all the TCP re-transmissions
and OOO events.

Teardown and errors:
====================
In case of NVMeTCP queue error the nvme-tcp-offload vendor driver shall
call the nvme_tcp_ofld_report_queue_err.
The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following teardown ops:
- drain_queue()
- destroy_queue()

The Marvell FastLinQ NIC HW engine:
====================================
The Marvell NIC HW engine is capable of offloading the entire TCP/IP
stack and managing up to 64K connections per PF, already implemented and
upstream use cases for this include iWARP (by the Marvell qedr driver)
and iSCSI (by the Marvell qedi driver).
In addition, the Marvell NIC HW engine offloads the NVMeTCP queue layer
and is able to manage the IO level also in case of TCP re-transmissions
and OOO events.
The HW engine enables direct data placement (including the data digest CRC
calculation and validation) and direct data transmission (including data
digest CRC calculation).

The Marvell qedn driver:
========================
The new driver will be added under "drivers/nvme/hw" and will be enabled
by the Kconfig "Marvell NVM Express over Fabrics TCP offload".
As part of the qedn init, the driver will register as a pci device driver
and will work with the Marvell fastlinQ NIC.
As part of the probe, the driver will register to the nvme_tcp_offload
(ULP) and to the qed module (qed_nvmetcp_ops) - similar to other
"qed_*_ops" which are used by the qede, qedr, qedf and qedi device
drivers.

nvme-tcp-offload Future work:
=============================
- NVMF_OPT_HOST_IFACE Support.

Changes since RFC v1:
=====================
- nvme-tcp-offload: Fix nvme_tcp_ofld_ops return values.
- nvme-tcp-offload: Remove NVMF_TRTYPE_TCP_OFFLOAD.
- nvme-tcp-offload: Add nvme_tcp_ofld_poll() implementation.
- nvme-tcp-offload: Fix nvme_tcp_ofld_queue_rq() to check map_sg() and
  send_req() return values.

Changes since RFC v2:
=====================
- nvme-tcp-offload: Fixes in controller and queue level (patches 3-6).
- qedn: Add the Marvell's NVMeTCP HW offload vendor driver init and probe
  (patches 8-11).

Changes since RFC v3:
=====================
- nvme-tcp-offload: Add the full implementation of the nvme-tcp-offload layer
  including the new ops: setup_ctrl(), release_ctrl(), commit_rqs() and new
  flows (ASYNC and timeout).
- nvme-tcp-offload: Add device maximums: max_hw_sectors, max_segments.
- nvme-tcp-offload: layer design and optimization changes.

Changes since RFC v4:
=====================
(Many thanks to Hannes Reinecke for his feedback)
- nvme_tcp_offload: Add num_hw_vectors in order to limit the number of queues.
- nvme_tcp_offload: Add per device private_data.
- nvme_tcp_offload: Fix header digest, data digest and tos initialization.

Changes since RFC v5:
=====================
(Many thanks to Sagi Grimberg for his feedback)
- nvme-fabrics: Expose nvmf_check_required_opts() globally (as a new patch).
- nvme_tcp_offload: Remove io-queues BLK_MQ_F_BLOCKING.
- nvme_tcp_offload: Fix the nvme_tcp_ofld_stop_queue (drain_queue) flow.
- nvme_tcp_offload: Fix the nvme_tcp_ofld_free_queue (destroy_queue) flow.
- nvme_tcp_offload: Change rwsem to mutex.
- nvme_tcp_offload: remove redundant fields.
- nvme_tcp_offload: Remove the "new" from setup_ctrl().
- nvme_tcp_offload: Remove the init_req() and commit_rqs() ops.
- nvme_tcp_offload: Minor fixes in nvme_tcp_ofld_create_ctrl() ansd
  nvme_tcp_ofld_free_queue().
- nvme_tcp_offload: Patch 8 (timeout and async) was squeashed into
  patch 7 (io level).

Changes since RFC v6:
=====================
- No changes in nvme_tcp_offload (only in qedn).
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents ae1d9cc3 35155e26
......@@ -13107,6 +13107,14 @@ F: drivers/nvme/host/
F: include/linux/nvme.h
F: include/uapi/linux/nvme_ioctl.h
NVM EXPRESS TCP OFFLOAD TRANSPORT DRIVERS
M: Shai Malin <smalin@marvell.com>
M: Ariel Elior <aelior@marvell.com>
L: linux-nvme@lists.infradead.org
S: Supported
F: drivers/nvme/host/tcp-offload.c
F: drivers/nvme/host/tcp-offload.h
NVM EXPRESS FC TRANSPORT DRIVERS
M: James Smart <james.smart@broadcom.com>
L: linux-nvme@lists.infradead.org
......
......@@ -84,3 +84,20 @@ config NVME_TCP
from https://github.com/linux-nvme/nvme-cli.
If unsure, say N.
config NVME_TCP_OFFLOAD
tristate "NVM Express over Fabrics TCP offload common layer"
default m
depends on BLOCK
depends on INET
select NVME_CORE
select NVME_FABRICS
help
This provides support for the NVMe over Fabrics protocol using
the TCP offload transport. This allows you to use remote block devices
exported using the NVMe protocol set.
To configure a NVMe over Fabrics controller use the nvme-cli tool
from https://github.com/linux-nvme/nvme-cli.
If unsure, say N.
......@@ -8,6 +8,7 @@ obj-$(CONFIG_NVME_FABRICS) += nvme-fabrics.o
obj-$(CONFIG_NVME_RDMA) += nvme-rdma.o
obj-$(CONFIG_NVME_FC) += nvme-fc.o
obj-$(CONFIG_NVME_TCP) += nvme-tcp.o
obj-$(CONFIG_NVME_TCP_OFFLOAD) += nvme-tcp-offload.o
nvme-core-y := core.o ioctl.o
nvme-core-$(CONFIG_TRACING) += trace.o
......@@ -26,3 +27,5 @@ nvme-rdma-y += rdma.o
nvme-fc-y += fc.o
nvme-tcp-y += tcp.o
nvme-tcp-offload-y += tcp-offload.o
......@@ -860,8 +860,8 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
return ret;
}
static int nvmf_check_required_opts(struct nvmf_ctrl_options *opts,
unsigned int required_opts)
int nvmf_check_required_opts(struct nvmf_ctrl_options *opts,
unsigned int required_opts)
{
if ((opts->mask & required_opts) != required_opts) {
int i;
......@@ -879,6 +879,7 @@ static int nvmf_check_required_opts(struct nvmf_ctrl_options *opts,
return 0;
}
EXPORT_SYMBOL_GPL(nvmf_check_required_opts);
bool nvmf_ip_options_match(struct nvme_ctrl *ctrl,
struct nvmf_ctrl_options *opts)
......@@ -942,13 +943,6 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
}
EXPORT_SYMBOL_GPL(nvmf_free_options);
#define NVMF_REQUIRED_OPTS (NVMF_OPT_TRANSPORT | NVMF_OPT_NQN)
#define NVMF_ALLOWED_OPTS (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\
NVMF_OPT_DISABLE_SQFLOW |\
NVMF_OPT_FAIL_FAST_TMO)
static struct nvme_ctrl *
nvmf_create_ctrl(struct device *dev, const char *buf)
{
......
......@@ -68,6 +68,13 @@ enum {
NVMF_OPT_FAIL_FAST_TMO = 1 << 20,
};
#define NVMF_REQUIRED_OPTS (NVMF_OPT_TRANSPORT | NVMF_OPT_NQN)
#define NVMF_ALLOWED_OPTS (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\
NVMF_OPT_DISABLE_SQFLOW |\
NVMF_OPT_FAIL_FAST_TMO)
/**
* struct nvmf_ctrl_options - Used to hold the options specified
* with the parsing opts enum.
......@@ -186,5 +193,7 @@ int nvmf_get_address(struct nvme_ctrl *ctrl, char *buf, int size);
bool nvmf_should_reconnect(struct nvme_ctrl *ctrl);
bool nvmf_ip_options_match(struct nvme_ctrl *ctrl,
struct nvmf_ctrl_options *opts);
int nvmf_check_required_opts(struct nvmf_ctrl_options *opts,
unsigned int required_opts);
#endif /* _NVME_FABRICS_H */
This diff is collapsed.
/* SPDX-License-Identifier: GPL-2.0 */
/*
* Copyright 2021 Marvell. All rights reserved.
*/
/* Linux includes */
#include <linux/dma-mapping.h>
#include <linux/scatterlist.h>
#include <linux/types.h>
#include <linux/nvme-tcp.h>
/* Driver includes */
#include "nvme.h"
#include "fabrics.h"
/* Forward declarations */
struct nvme_tcp_ofld_ops;
/* Representation of a vendor-specific device. This is the struct used to
* register to the offload layer by the vendor-specific driver during its probe
* function.
* Allocated by vendor-specific driver.
*/
struct nvme_tcp_ofld_dev {
struct list_head entry;
struct net_device *ndev;
struct nvme_tcp_ofld_ops *ops;
/* Vendor specific driver context */
int num_hw_vectors;
};
/* Per IO struct holding the nvme_request and command
* Allocated by blk-mq.
*/
struct nvme_tcp_ofld_req {
struct nvme_request req;
struct nvme_command nvme_cmd;
struct list_head queue_entry;
struct nvme_tcp_ofld_queue *queue;
/* Vendor specific driver context */
void *private_data;
/* async flag is used to distinguish between async and IO flow
* in common send_req() of nvme_tcp_ofld_ops.
*/
bool async;
void (*done)(struct nvme_tcp_ofld_req *req,
union nvme_result *result,
__le16 status);
};
enum nvme_tcp_ofld_queue_flags {
NVME_TCP_OFLD_Q_ALLOCATED = 0,
NVME_TCP_OFLD_Q_LIVE = 1,
};
/* Allocated by nvme_tcp_ofld */
struct nvme_tcp_ofld_queue {
/* Offload device associated to this queue */
struct nvme_tcp_ofld_dev *dev;
struct nvme_tcp_ofld_ctrl *ctrl;
unsigned long flags;
size_t cmnd_capsule_len;
/* mutex used during stop_queue */
struct mutex queue_lock;
u8 hdr_digest;
u8 data_digest;
u8 tos;
/* Vendor specific driver context */
void *private_data;
/* Error callback function */
int (*report_err)(struct nvme_tcp_ofld_queue *queue);
};
/* Connectivity (routing) params used for establishing a connection */
struct nvme_tcp_ofld_ctrl_con_params {
struct sockaddr_storage remote_ip_addr;
/* If NVMF_OPT_HOST_TRADDR is provided it will be set in local_ip_addr
* in nvme_tcp_ofld_create_ctrl().
* If NVMF_OPT_HOST_TRADDR is not provided the local_ip_addr will be
* initialized by claim_dev().
*/
struct sockaddr_storage local_ip_addr;
};
/* Allocated by nvme_tcp_ofld */
struct nvme_tcp_ofld_ctrl {
struct nvme_ctrl nctrl;
struct list_head list;
struct nvme_tcp_ofld_dev *dev;
/* admin and IO queues */
struct blk_mq_tag_set tag_set;
struct blk_mq_tag_set admin_tag_set;
struct nvme_tcp_ofld_queue *queues;
struct work_struct err_work;
struct delayed_work connect_work;
/*
* Each entry in the array indicates the number of queues of
* corresponding type.
*/
u32 io_queues[HCTX_MAX_TYPES];
/* Connectivity params */
struct nvme_tcp_ofld_ctrl_con_params conn_params;
struct nvme_tcp_ofld_req async_req;
/* Vendor specific driver context */
void *private_data;
};
struct nvme_tcp_ofld_ops {
const char *name;
struct module *module;
/* For vendor-specific driver to report what opts it supports.
* It could be different than the ULP supported opts due to hardware
* limitations. Also it could be different among different vendor
* drivers.
*/
int required_opts; /* bitmap using enum nvmf_parsing_opts */
int allowed_opts; /* bitmap using enum nvmf_parsing_opts */
/* For vendor-specific max num of segments and IO sizes */
u32 max_hw_sectors;
u32 max_segments;
/**
* claim_dev: Return True if addr is reachable via offload device.
* @dev: The offload device to check.
* @ctrl: The offload ctrl have the conn_params field. The
* conn_params is to be filled with routing params by the lower
* driver.
*/
int (*claim_dev)(struct nvme_tcp_ofld_dev *dev,
struct nvme_tcp_ofld_ctrl *ctrl);
/**
* setup_ctrl: Setup device specific controller structures.
* @ctrl: The offload ctrl.
*/
int (*setup_ctrl)(struct nvme_tcp_ofld_ctrl *ctrl);
/**
* release_ctrl: Release/Free device specific controller structures.
* @ctrl: The offload ctrl.
*/
int (*release_ctrl)(struct nvme_tcp_ofld_ctrl *ctrl);
/**
* create_queue: Create offload queue and establish TCP + NVMeTCP
* (icreq+icresp) connection. Return true on successful connection.
* Based on nvme_tcp_alloc_queue.
* @queue: The queue itself - used as input and output.
* @qid: The queue ID associated with the requested queue.
* @q_size: The queue depth.
*/
int (*create_queue)(struct nvme_tcp_ofld_queue *queue, int qid,
size_t queue_size);
/**
* drain_queue: Drain a given queue - blocking function call.
* Return from this function ensures that no additional
* completions will arrive on this queue and that the HW will
* not access host memory.
* @queue: The queue to drain.
*/
void (*drain_queue)(struct nvme_tcp_ofld_queue *queue);
/**
* destroy_queue: Close the TCP + NVMeTCP connection of a given queue
* and make sure its no longer active (no completions will arrive on the
* queue).
* @queue: The queue to destroy.
*/
void (*destroy_queue)(struct nvme_tcp_ofld_queue *queue);
/**
* poll_queue: Poll a given queue for completions.
* @queue: The queue to poll.
*/
int (*poll_queue)(struct nvme_tcp_ofld_queue *queue);
/**
* send_req: Dispatch a request. Returns the execution status.
* @req: Ptr to request to be sent.
*/
int (*send_req)(struct nvme_tcp_ofld_req *req);
};
/* Exported functions for lower vendor specific offload drivers */
int nvme_tcp_ofld_register_dev(struct nvme_tcp_ofld_dev *dev);
void nvme_tcp_ofld_unregister_dev(struct nvme_tcp_ofld_dev *dev);
void nvme_tcp_ofld_error_recovery(struct nvme_ctrl *nctrl);
inline size_t nvme_tcp_ofld_inline_data_size(struct nvme_tcp_ofld_queue *queue);
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment