Commit 143bdc2e authored by Daniel Borkmann's avatar Daniel Borkmann

Merge branch 'bpf-libbpf-af-xdp'

Magnus Karlsson says:

====================
This patch proposes to add AF_XDP support to libbpf. The main reason
for this is to facilitate writing applications that use AF_XDP by
offering higher-level APIs that hide many of the details of the AF_XDP
uapi. This is in the same vein as libbpf facilitates XDP adoption by
offering easy-to-use higher level interfaces of XDP
functionality. Hopefully this will facilitate adoption of AF_XDP, make
applications using it simpler and smaller, and finally also make it
possible for applications to benefit from optimizations in the AF_XDP
user space access code. Previously, people just copied and pasted the
code from the sample application into their application, which is not
desirable.

The proposed interface is composed of two parts:

* Low-level access interface to the four rings and the packet
* High-level control plane interface for creating and setting up umems
  and AF_XDP sockets. This interface also loads a simple XDP program
  that routes all traffic on a queue up to the AF_XDP socket.

The sample program has been updated to use this new interface and in
that process it lost roughly 300 lines of code. I cannot detect any
performance degradations due to the use of this library instead of the
previous functions that were inlined in the sample application. But I
did measure this on a slower machine and not the Broadwell that we
normally use.

The rings are now called xsk_ring and when a producer operates on
it. It is xsk_ring_prod and for a consumer it is xsk_ring_cons. This
way we can get some compile time error checking that the rings are
used correctly.

Comments and contenplations:

* The current behaviour is that the library loads an XDP program (if
  requested to do so) but the clean up of this program is left to the
  application. It would be possible to implement this cleanup in the
  library, but it would require state to be kept on netdev level,
  which there is none at the moment, and the synchronization of this
  between processes. All this adding complexity. But when we get an
  XDP program per queue id, then it becomes trivial to also remove the
  XDP program when the application exits. This proposal from Jesper,
  Björn and others will also improve the performance of libbpf, since
  most of the XDP program code can be removed when that feature is
  supported.

* In a future release, I am planning on adding a higher level data
  plane interface too. This will be based around recvmsg and sendmsg
  with the use of struct iovec for batching, without the user having
  to know anything about the underlying four rings of an AF_XDP
  socket. There will be one semantic difference though from the
  standard recvmsg and that is that the kernel will fill in the iovecs
  instead of the application. But the rest should be the same as the
  libc versions so that application writers feel at home.

Patch 1: adds AF_XDP support in libbpf
Patch 2: updates the xdpsock sample application to use the libbpf functions
Patch 3: Documentation update to help first time users

Changes v5 to v6:
  * Fixed prog_fd bug found by Xiaolong Ye. Thanks!
Changes v4 to v5:
  * Added a FAQ to the documentation
  * Removed xsk_umem__get_data and renamed xsk_umem__get_dat_raw to
    xsk_umem__get_data
  * Replaced the netlink code with bpf_get_link_xdp_id()
  * Dynamic allocation of the map sizes. They are now sized after
    the max number of queueus on the netdev in question.
Changes v3 to v4:
  * Dropped the pr_*() patch in favor of Yonghong Song's patch set
  * Addressed the review comments of Daniel Borkmann, mainly leaking
    of file descriptors at clean up and making the data plane APIs
    all static inline (with the exception of xsk_umem__get_data that
    uses an internal structure I do not want to expose).
  * Fixed the netlink callback as suggested by Maciej Fijalkowski.
  * Removed an unecessary include in the sample program as spotted by
    Ilia Fillipov.
Changes v2 to v3:
  * Added automatic loading of a simple XDP program that routes all
    traffic on a queue up to the AF_XDP socket. This program loading
    can be disabled.
  * Updated function names to be consistent with the libbpf naming
    convention
  * Moved all code to xsk.[ch]
  * Removed all the XDP program loading code from the sample since
    this is now done by libbpf
  * The initialization functions now return a handle as suggested by
    Alexei
  * const statements added in the API where applicable.
Changes v1 to v2:
  * Fixed cleanup of library state on error.
  * Moved API to initial version
  * Prefixed all public functions by xsk__ instead of xsk_
  * Added comment about changed default ring sizes, batch size and umem
    size in the sample application commit message
  * The library now only creates an Rx or Tx ring if the respective
    parameter is != NULL
====================
Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
parents 740f8a65 0f4a9b7d
......@@ -295,6 +295,41 @@ using::
For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.
FAQ
=======
Q: I am not seeing any traffic on the socket. What am I doing wrong?
A: When a netdev of a physical NIC is initialized, Linux usually
allocates one Rx and Tx queue pair per core. So on a 8 core system,
queue ids 0 to 7 will be allocated, one per core. In the AF_XDP
bind call or the xsk_socket__create libbpf function call, you
specify a specific queue id to bind to and it is only the traffic
towards that queue you are going to get on you socket. So in the
example above, if you bind to queue 0, you are NOT going to get any
traffic that is distributed to queues 1 through 7. If you are
lucky, you will see the traffic, but usually it will end up on one
of the queues you have not bound to.
There are a number of ways to solve the problem of getting the
traffic you want to the queue id you bound to. If you want to see
all the traffic, you can force the netdev to only have 1 queue, queue
id 0, and then bind to queue 0. You can use ethtool to do this::
sudo ethtool -L <interface> combined 1
If you want to only see part of the traffic, you can program the
NIC through ethtool to filter out your traffic to a single queue id
that you can bind your XDP socket to. Here is one example in which
UDP traffic to and from port 4242 are sent to queue 2::
sudo ethtool -N <interface> rx-flow-hash udp4 fn
sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \
4242 action 2
A number of other ways are possible all up to the capabilitites of
the NIC you have.
Credits
=======
......@@ -309,4 +344,3 @@ Credits
- Michael S. Tsirkin
- Qi Z Zhang
- Willem de Bruijn
......@@ -163,7 +163,6 @@ always += xdp2skb_meta_kern.o
always += syscall_tp_kern.o
always += cpustat_kern.o
always += xdp_adjust_tail_kern.o
always += xdpsock_kern.o
always += xdp_fwd_kern.o
always += task_fd_query_kern.o
always += xdp_sample_pkts_kern.o
......
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef XDPSOCK_H_
#define XDPSOCK_H_
/* Power-of-2 number of sockets */
#define MAX_SOCKS 4
/* Round-robin receive */
#define RR_LB 0
#endif /* XDPSOCK_H_ */
// SPDX-License-Identifier: GPL-2.0
#define KBUILD_MODNAME "foo"
#include <uapi/linux/bpf.h>
#include "bpf_helpers.h"
#include "xdpsock.h"
struct bpf_map_def SEC("maps") qidconf_map = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 1,
};
struct bpf_map_def SEC("maps") xsks_map = {
.type = BPF_MAP_TYPE_XSKMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = MAX_SOCKS,
};
struct bpf_map_def SEC("maps") rr_map = {
.type = BPF_MAP_TYPE_PERCPU_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(unsigned int),
.max_entries = 1,
};
SEC("xdp_sock")
int xdp_sock_prog(struct xdp_md *ctx)
{
int *qidconf, key = 0, idx;
unsigned int *rr;
qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
if (!qidconf)
return XDP_ABORTED;
if (*qidconf != ctx->rx_queue_index)
return XDP_PASS;
#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
rr = bpf_map_lookup_elem(&rr_map, &key);
if (!rr)
return XDP_ABORTED;
*rr = (*rr + 1) & (MAX_SOCKS - 1);
idx = *rr;
#else
idx = 0;
#endif
return bpf_redirect_map(&xsks_map, idx, 0);
}
char _license[] SEC("license") = "GPL";
This diff is collapsed.
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
/*
* ethtool.h: Defines for Linux ethtool.
*
* Copyright (C) 1998 David S. Miller (davem@redhat.com)
* Copyright 2001 Jeff Garzik <jgarzik@pobox.com>
* Portions Copyright 2001 Sun Microsystems (thockin@sun.com)
* Portions Copyright 2002 Intel (eli.kupermann@intel.com,
* christopher.leech@intel.com,
* scott.feldman@intel.com)
* Portions Copyright (C) Sun Microsystems 2008
*/
#ifndef _UAPI_LINUX_ETHTOOL_H
#define _UAPI_LINUX_ETHTOOL_H
#include <linux/kernel.h>
#include <linux/types.h>
#include <linux/if_ether.h>
#define ETHTOOL_GCHANNELS 0x0000003c /* Get no of channels */
/**
* struct ethtool_channels - configuring number of network channel
* @cmd: ETHTOOL_{G,S}CHANNELS
* @max_rx: Read only. Maximum number of receive channel the driver support.
* @max_tx: Read only. Maximum number of transmit channel the driver support.
* @max_other: Read only. Maximum number of other channel the driver support.
* @max_combined: Read only. Maximum number of combined channel the driver
* support. Set of queues RX, TX or other.
* @rx_count: Valid values are in the range 1 to the max_rx.
* @tx_count: Valid values are in the range 1 to the max_tx.
* @other_count: Valid values are in the range 1 to the max_other.
* @combined_count: Valid values are in the range 1 to the max_combined.
*
* This can be used to configure RX, TX and other channels.
*/
struct ethtool_channels {
__u32 cmd;
__u32 max_rx;
__u32 max_tx;
__u32 max_other;
__u32 max_combined;
__u32 rx_count;
__u32 tx_count;
__u32 other_count;
__u32 combined_count;
};
#endif /* _UAPI_LINUX_ETHTOOL_H */
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
/*
* if_xdp: XDP socket user-space interface
* Copyright(c) 2018 Intel Corporation.
*
* Author(s): Björn Töpel <bjorn.topel@intel.com>
* Magnus Karlsson <magnus.karlsson@intel.com>
*/
#ifndef _LINUX_IF_XDP_H
#define _LINUX_IF_XDP_H
#include <linux/types.h>
/* Options for the sxdp_flags field */
#define XDP_SHARED_UMEM (1 << 0)
#define XDP_COPY (1 << 1) /* Force copy-mode */
#define XDP_ZEROCOPY (1 << 2) /* Force zero-copy mode */
struct sockaddr_xdp {
__u16 sxdp_family;
__u16 sxdp_flags;
__u32 sxdp_ifindex;
__u32 sxdp_queue_id;
__u32 sxdp_shared_umem_fd;
};
struct xdp_ring_offset {
__u64 producer;
__u64 consumer;
__u64 desc;
};
struct xdp_mmap_offsets {
struct xdp_ring_offset rx;
struct xdp_ring_offset tx;
struct xdp_ring_offset fr; /* Fill */
struct xdp_ring_offset cr; /* Completion */
};
/* XDP socket options */
#define XDP_MMAP_OFFSETS 1
#define XDP_RX_RING 2
#define XDP_TX_RING 3
#define XDP_UMEM_REG 4
#define XDP_UMEM_FILL_RING 5
#define XDP_UMEM_COMPLETION_RING 6
#define XDP_STATISTICS 7
struct xdp_umem_reg {
__u64 addr; /* Start of packet data area */
__u64 len; /* Length of packet data area */
__u32 chunk_size;
__u32 headroom;
};
struct xdp_statistics {
__u64 rx_dropped; /* Dropped for reasons other than invalid desc */
__u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
};
/* Pgoff for mmaping the rings */
#define XDP_PGOFF_RX_RING 0
#define XDP_PGOFF_TX_RING 0x80000000
#define XDP_UMEM_PGOFF_FILL_RING 0x100000000ULL
#define XDP_UMEM_PGOFF_COMPLETION_RING 0x180000000ULL
/* Rx/Tx descriptor */
struct xdp_desc {
__u64 addr;
__u32 len;
__u32 options;
};
/* UMEM descriptor is __u64 */
#endif /* _LINUX_IF_XDP_H */
libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o netlink.o bpf_prog_linfo.o libbpf_probes.o
libbpf-y := libbpf.o bpf.o nlattr.o btf.o libbpf_errno.o str_error.o netlink.o bpf_prog_linfo.o libbpf_probes.o xsk.o
......@@ -164,6 +164,9 @@ $(BPF_IN): force elfdep bpfdep
@(test -f ../../include/uapi/linux/if_link.h -a -f ../../../include/uapi/linux/if_link.h && ( \
(diff -B ../../include/uapi/linux/if_link.h ../../../include/uapi/linux/if_link.h >/dev/null) || \
echo "Warning: Kernel ABI header at 'tools/include/uapi/linux/if_link.h' differs from latest version at 'include/uapi/linux/if_link.h'" >&2 )) || true
@(test -f ../../include/uapi/linux/if_xdp.h -a -f ../../../include/uapi/linux/if_xdp.h && ( \
(diff -B ../../include/uapi/linux/if_xdp.h ../../../include/uapi/linux/if_xdp.h >/dev/null) || \
echo "Warning: Kernel ABI header at 'tools/include/uapi/linux/if_xdp.h' differs from latest version at 'include/uapi/linux/if_xdp.h'" >&2 )) || true
$(Q)$(MAKE) $(build)=libbpf
$(OUTPUT)libbpf.so: $(BPF_IN)
......@@ -174,7 +177,7 @@ $(OUTPUT)libbpf.a: $(BPF_IN)
$(QUIET_LINK)$(RM) $@; $(AR) rcs $@ $^
$(OUTPUT)test_libbpf: test_libbpf.cpp $(OUTPUT)libbpf.a
$(QUIET_LINK)$(CXX) $^ -lelf -o $@
$(QUIET_LINK)$(CXX) $(INCLUDES) $^ -lelf -o $@
check: check_abi
......
......@@ -9,7 +9,7 @@ described here. It's recommended to follow these conventions whenever a
new function or type is added to keep libbpf API clean and consistent.
All types and functions provided by libbpf API should have one of the
following prefixes: ``bpf_``, ``btf_``, ``libbpf_``.
following prefixes: ``bpf_``, ``btf_``, ``libbpf_``, ``xsk_``.
System call wrappers
--------------------
......@@ -62,6 +62,19 @@ Auxiliary functions and types that don't fit well in any of categories
described above should have ``libbpf_`` prefix, e.g.
``libbpf_get_error`` or ``libbpf_prog_type_by_name``.
AF_XDP functions
-------------------
AF_XDP functions should have an ``xsk_`` prefix, e.g.
``xsk_umem__get_data`` or ``xsk_umem__create``. The interface consists
of both low-level ring access functions and high-level configuration
functions. These can be mixed and matched. Note that these functions
are not reentrant for performance reasons.
Please take a look at Documentation/networking/af_xdp.rst in the Linux
kernel source tree on how to use XDP sockets and for some common
mistakes in case you do not get any traffic up to user space.
libbpf ABI
==========
......
......@@ -147,4 +147,10 @@ LIBBPF_0.0.2 {
btf_ext__new;
btf_ext__reloc_func_info;
btf_ext__reloc_line_info;
xsk_umem__create;
xsk_socket__create;
xsk_umem__delete;
xsk_socket__delete;
xsk_umem__fd;
xsk_socket__fd;
} LIBBPF_0.0.1;
This diff is collapsed.
/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
/*
* AF_XDP user-space access library.
*
* Copyright(c) 2018 - 2019 Intel Corporation.
*
* Author(s): Magnus Karlsson <magnus.karlsson@intel.com>
*/
#ifndef __LIBBPF_XSK_H
#define __LIBBPF_XSK_H
#include <stdio.h>
#include <stdint.h>
#include <linux/if_xdp.h>
#include "libbpf.h"
#ifdef __cplusplus
extern "C" {
#endif
/* Do not access these members directly. Use the functions below. */
#define DEFINE_XSK_RING(name) \
struct name { \
__u32 cached_prod; \
__u32 cached_cons; \
__u32 mask; \
__u32 size; \
__u32 *producer; \
__u32 *consumer; \
void *ring; \
}
DEFINE_XSK_RING(xsk_ring_prod);
DEFINE_XSK_RING(xsk_ring_cons);
struct xsk_umem;
struct xsk_socket;
static inline __u64 *xsk_ring_prod__fill_addr(struct xsk_ring_prod *fill,
__u32 idx)
{
__u64 *addrs = (__u64 *)fill->ring;
return &addrs[idx & fill->mask];
}
static inline const __u64 *
xsk_ring_cons__comp_addr(const struct xsk_ring_cons *comp, __u32 idx)
{
const __u64 *addrs = (const __u64 *)comp->ring;
return &addrs[idx & comp->mask];
}
static inline struct xdp_desc *xsk_ring_prod__tx_desc(struct xsk_ring_prod *tx,
__u32 idx)
{
struct xdp_desc *descs = (struct xdp_desc *)tx->ring;
return &descs[idx & tx->mask];
}
static inline const struct xdp_desc *
xsk_ring_cons__rx_desc(const struct xsk_ring_cons *rx, __u32 idx)
{
const struct xdp_desc *descs = (const struct xdp_desc *)rx->ring;
return &descs[idx & rx->mask];
}
static inline __u32 xsk_prod_nb_free(struct xsk_ring_prod *r, __u32 nb)
{
__u32 free_entries = r->cached_cons - r->cached_prod;
if (free_entries >= nb)
return free_entries;
/* Refresh the local tail pointer.
* cached_cons is r->size bigger than the real consumer pointer so
* that this addition can be avoided in the more frequently
* executed code that computs free_entries in the beginning of
* this function. Without this optimization it whould have been
* free_entries = r->cached_prod - r->cached_cons + r->size.
*/
r->cached_cons = *r->consumer + r->size;
return r->cached_cons - r->cached_prod;
}
static inline __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb)
{
__u32 entries = r->cached_prod - r->cached_cons;
if (entries == 0) {
r->cached_prod = *r->producer;
entries = r->cached_prod - r->cached_cons;
}
return (entries > nb) ? nb : entries;
}
static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod,
size_t nb, __u32 *idx)
{
if (unlikely(xsk_prod_nb_free(prod, nb) < nb))
return 0;
*idx = prod->cached_prod;
prod->cached_prod += nb;
return nb;
}
static inline void xsk_ring_prod__submit(struct xsk_ring_prod *prod, size_t nb)
{
/* Make sure everything has been written to the ring before signalling
* this to the kernel.
*/
smp_wmb();
*prod->producer += nb;
}
static inline size_t xsk_ring_cons__peek(struct xsk_ring_cons *cons,
size_t nb, __u32 *idx)
{
size_t entries = xsk_cons_nb_avail(cons, nb);
if (likely(entries > 0)) {
/* Make sure we do not speculatively read the data before
* we have received the packet buffers from the ring.
*/
smp_rmb();
*idx = cons->cached_cons;
cons->cached_cons += entries;
}
return entries;
}
static inline void xsk_ring_cons__release(struct xsk_ring_cons *cons, size_t nb)
{
*cons->consumer += nb;
}
static inline void *xsk_umem__get_data(void *umem_area, __u64 addr)
{
return &((char *)umem_area)[addr];
}
LIBBPF_API int xsk_umem__fd(const struct xsk_umem *umem);
LIBBPF_API int xsk_socket__fd(const struct xsk_socket *xsk);
#define XSK_RING_CONS__DEFAULT_NUM_DESCS 2048
#define XSK_RING_PROD__DEFAULT_NUM_DESCS 2048
#define XSK_UMEM__DEFAULT_FRAME_SHIFT 11 /* 2048 bytes */
#define XSK_UMEM__DEFAULT_FRAME_SIZE (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT)
#define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0
struct xsk_umem_config {
__u32 fill_size;
__u32 comp_size;
__u32 frame_size;
__u32 frame_headroom;
};
/* Flags for the libbpf_flags field. */
#define XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD (1 << 0)
struct xsk_socket_config {
__u32 rx_size;
__u32 tx_size;
__u32 libbpf_flags;
__u32 xdp_flags;
__u16 bind_flags;
};
/* Set config to NULL to get the default configuration. */
LIBBPF_API int xsk_umem__create(struct xsk_umem **umem,
void *umem_area, __u64 size,
struct xsk_ring_prod *fill,
struct xsk_ring_cons *comp,
const struct xsk_umem_config *config);
LIBBPF_API int xsk_socket__create(struct xsk_socket **xsk,
const char *ifname, __u32 queue_id,
struct xsk_umem *umem,
struct xsk_ring_cons *rx,
struct xsk_ring_prod *tx,
const struct xsk_socket_config *config);
/* Returns 0 for success and -EBUSY if the umem is still in use. */
LIBBPF_API int xsk_umem__delete(struct xsk_umem *umem);
LIBBPF_API void xsk_socket__delete(struct xsk_socket *xsk);
#ifdef __cplusplus
} /* extern "C" */
#endif
#endif /* __LIBBPF_XSK_H */
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment