Commits · d3d854fd6a1d97157f790604e07f6386e8df8fe4 · Kirill Smelkov / linux

03 Feb, 2023 7 commits

netdev-genl: create a simple family for netdev stuff · d3d854fd

Jakub Kicinski authored Feb 01, 2023

Add a Netlink spec-compatible family for netdevs.
This is a very simple implementation without much
thought going into it.

It allows us to reap all the benefits of Netlink specs,
one can use the generic client to issue the commands:

  $ ./cli.py --spec netdev.yaml --dump dev_get
  [{'ifindex': 1, 'xdp-features': set()},
   {'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}},
   {'ifindex': 3, 'xdp-features': {'rx-sg'}}]

the generic python library does not have flags-by-name
support, yet, but we also don't have to carry strings
in the messages, as user space can get the names from
the spec.
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Co-developed-by: Marek Majtyka <alardam@gmail.com>
Signed-off-by: Marek Majtyka <alardam@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

d3d854fd

selftests/bpf: Use semicolon instead of comma in test_verifier.c · 15080908

Tiezhu Yang authored Feb 02, 2023

Just silence the following checkpatch warning:

WARNING: Possible comma where semicolon could be used
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Link: https://lore.kernel.org/r/1675319486-27744-3-git-send-email-yangtiezhu@loongson.cnSigned-off-by: Alexei Starovoitov <ast@kernel.org>

15080908

tools/bpf: Use tab instead of white spaces to sync bpf.h · e2bd9742

Tiezhu Yang authored Feb 02, 2023

Just silence the following build warning:

Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h'
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Link: https://lore.kernel.org/r/1675319486-27744-2-git-send-email-yangtiezhu@loongson.cnSigned-off-by: Alexei Starovoitov <ast@kernel.org>

e2bd9742

bpf: Drop always true do_idr_lock parameter to bpf_map_free_id · 158e5e9e

Tobias Klauser authored Feb 02, 2023

The do_idr_lock parameter to bpf_map_free_id was introduced by commit
bd5f5f4e ("bpf: Add BPF_MAP_GET_FD_BY_ID"). However, all callers set
do_idr_lock = true since commit 1e0bd5a0 ("bpf: Switch bpf_map ref
counter to atomic64_t so bpf_map_inc() never fails").

While at it also inline __bpf_map_put into its only caller bpf_map_put
now that do_idr_lock can be dropped from its signature.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Link: https://lore.kernel.org/r/20230202141921.4424-1-tklauser@distanz.chSigned-off-by: Alexei Starovoitov <ast@kernel.org>

158e5e9e

Merge branch ' docs/bpf: Add description of register liveness tracking algorithm' · d9e44c32

Alexei Starovoitov authored Feb 02, 2023

Eduard Zingerman says:

====================

An overview of the register tracking liveness algorithm.
Previous versions posted here: [1], [2], [3].
- Changes from RFC to v2 (suggested by Andrii Nakryiko):
  - wording corrected to use term "stack slot" instead of "stack spill";
  - parentage chain diagram updated to show nil links for frame #1;
  - added example for non-BPF_DW writes behavior;
  - explanation in "Read marks propagation for cache hits" is reworked.
- Changes from v2 to v3:
  - lot's of grammatical / wording fixes as suggested by David Vernet;
  - "Register parentage chains" section is fixed to reflect what
    happens to r1-r5 when function call is processed (as suggested by David and Alexei);
  - Example in "Liveness marks tracking" section updated to explain
    why partial writes should not lead to REG_LIVE_WRITTEN marks
    (suggested by David);
  - "Read marks propagation for cache hits" section updates:
    - Explanation updated to hint why read marks should be propagated
      before jumping to example (suggested by David);
    - Removed box around B/D in the diagram updated (suggested by Alexei).
- Changes from v3 to v4 (suggested by Edward Cree):
  - register parentage chain diagram updated to explain why r6 mark is
    not propagated;
  - read mark propagation algorithm pseudo-code fixed to correctly
    show "if state->live & REG_LIVE_WRITTEN" stop condition;
  - general wording improvements in section "Liveness marks tracking".

[1] https://lore.kernel.org/bpf/20230124220343.2942203-1-eddyz87@gmail.com/
[2] https://lore.kernel.org/bpf/20230130182400.630997-1-eddyz87@gmail.com/
[3] https://lore.kernel.org/bpf/20230131181118.733845-1-eddyz87@gmail.com/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d9e44c32

docs/bpf: Add description of register liveness tracking algorithm · cb601848

Eduard Zingerman authored Feb 02, 2023

This is a followup for [1], adds an overview for the register liveness
tracking, covers the following points:
- why register liveness tracking is useful;
- how register parentage chains are constructed;
- how liveness marks are applied using the parentage chains.

[1] https://lore.kernel.org/bpf/CAADnVQKs2i1iuZ5SUGuJtxWVfGYR9kDgYKhq3rNV+kBLQCu7rA@mail.gmail.com/Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/r/20230202125713.821931-2-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

cb601848

selftests/bpf: Initialize tc in xdp_synproxy · 354bb4a0

Ilya Leoshkevich authored Feb 03, 2023

xdp_synproxy/xdp fails in CI with:

Error: bpf_tc_hook_create: File exists

The XDP version of the test should not be calling bpf_tc_hook_create();
the reason it's happening anyway is that if we don't specify --tc on the
command line, tc variable remains uninitialized.

Fixes: 784d5dc0 ("selftests/bpf: Add selftests for raw syncookie helpers in TC mode")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Reported-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230202235335.3403781-1-iii@linux.ibm.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

354bb4a0

02 Feb, 2023 2 commits

selftests/bpf: Remove duplicate include header in xdp_hw_metadata · 4bc32df7

Ye Xingchen authored Jan 31, 2023

The linux/net_tstamp.h is included more than once, thus clean it up.
Signed-off-by: Ye Xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/202301311440516312161@zte.com.cn

4bc32df7

selftests/bpf: Don't refill on completion in xdp_metadata · 8b79b34a

Stanislav Fomichev authored Feb 01, 2023

We only need to consume TX completion instead of refilling 'fill' ring.
It's currently not an issue because we never RX more than 8 packets.

Fixes: e2a46d54 ("selftests/bpf: Verify xdp_metadata xdp->af_xdp path")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230201233640.367646-1-sdf@google.com

8b79b34a

01 Feb, 2023 24 commits

selftests/bpf: xdp_hw_metadata use strncpy for ifname · e8a3c8bd

Jesper Dangaard Brouer authored Feb 01, 2023

The ifname char pointer is taken directly from the command line
as input and the string is copied directly into struct ifreq
via strcpy. This makes it easy to corrupt other members of ifreq
and generally do stack overflows.

Most often the ioctl will fail with:

 ./xdp_hw_metadata: ioctl(SIOCETHTOOL): Bad address

As people will likely copy-paste code for getting NIC queue
channels (rxq_num) and enabling HW timestamping (hwtstamp_ioctl)
lets make this code a bit more secure by using strncpy.

Fixes: 297a3f12 ("selftests/bpf: Simple program to dump XDP RX metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/167527272543.937063.16993147790832546209.stgit@firesoul

e8a3c8bd

selftests/bpf: xdp_hw_metadata correct status value in error(3) · 7bd4224d

Jesper Dangaard Brouer authored Feb 01, 2023

The glibc error reporting function error():

 void error(int status, int errnum, const char *format, ...);

The status argument should be a positive value between 0-255 as it
is passed over to the exit(3) function as the value as the shell exit
status. The least significant byte of status (i.e., status & 0xFF) is
returned to the shell parent.

Fix this by using 1 instead of -1. As 1 corresponds to C standard
constant EXIT_FAILURE.

Fixes: 297a3f12 ("selftests/bpf: Simple program to dump XDP RX metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/167527272038.937063.9137108142012298120.stgit@firesoul

7bd4224d

selftests/bpf: xdp_hw_metadata cleanup cause segfault · a19a62e5

Jesper Dangaard Brouer authored Feb 01, 2023

Using xdp_hw_metadata I experince Segmentation fault after seeing
"detaching bpf program....".

On my system the segfault happened when accessing bpf_obj->skeleton
in xdp_hw_metadata__destroy(bpf_obj) call. That doesn't make any sense
as this memory have not been freed by program at this point in time.

Prior to calling xdp_hw_metadata__destroy(bpf_obj) the function
close_xsk() is called for each RX-queue xsk.  The real bug lays
in close_xsk() that unmap via munmap() the wrong memory pointer.
The call xsk_umem__delete(xsk->umem) will free xsk->umem, thus
the call to munmap(xsk->umem, UMEM_SIZE) will have unpredictable
behavior. And man page explain subsequent references to these
pages will generate SIGSEGV.

Unmapping xsk->umem_area instead removes the segfault.

Fixes: 297a3f12 ("selftests/bpf: Simple program to dump XDP RX metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/167527271533.937063.5717065138099679142.stgit@firesoul

a19a62e5

selftests/bpf: xdp_hw_metadata clear metadata when -EOPNOTSUPP · 3fd9dcd6

Jesper Dangaard Brouer authored Feb 01, 2023

The AF_XDP userspace part of xdp_hw_metadata see non-zero as a signal of
the availability of rx_timestamp and rx_hash in data_meta area. The
kernel-side BPF-prog code doesn't initialize these members when kernel
returns an error e.g. -EOPNOTSUPP.  This memory area is not guaranteed to
be zeroed, and can contain garbage/previous values, which will be read
and interpreted by AF_XDP userspace side.

Tested this on different drivers. The experiences are that for most
packets they will have zeroed this data_meta area, but occasionally it
will contain garbage data.

Example of failure tested on ixgbe:

 poll: 1 (0)
 xsk_ring_cons__peek: 1
 0x18ec788: rx_desc[0]->addr=100000000008000 addr=8100 comp_addr=8000
 rx_hash: 3697961069
 rx_timestamp:  9024981991734834796 (sec:9024981991.7348)
 0x18ec788: complete idx=8 addr=8000

Converting to date:

 date -d @9024981991
 2255-12-28T20:26:31 CET

I choose a simple fix in this patch. When kfunc fails or isn't supported
assign zero to the corresponding struct meta value.

It's up to the individual BPF-programmer to do something smarter e.g.
that fits their use-case, like getting a software timestamp and marking
a flag that gives the type of timestamp.

Fixes: 297a3f12 ("selftests/bpf: Simple program to dump XDP RX metadata")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/167527271027.937063.5177725618616476592.stgit@firesoul

3fd9dcd6

selftests/bpf: Fix unmap bug in prog_tests/xdp_metadata.c · f2922f77

Jesper Dangaard Brouer authored Feb 01, 2023

The function close_xsk() unmap via munmap() the wrong memory pointer.
The call xsk_umem__delete(xsk->umem) have already freed xsk->umem.
Thus the call to munmap(xsk->umem, UMEM_SIZE) will have unpredictable
behavior that can lead to Segmentation fault elsewhere, as man page
explain subsequent references to these pages will generate SIGSEGV.

Fixes: e2a46d54 ("selftests/bpf: Verify xdp_metadata xdp->af_xdp path")
Reported-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/167527517464.938135.13750760520577765269.stgit@firesoul

f2922f77

Merge branch 'kfunc-annotation' · 36b0fb13

Daniel Borkmann authored Feb 02, 2023

David Vernet says:

====================
This is v3 of the patchset [0]. v2 can be found at [1].

[0]: https://lore.kernel.org/bpf/Y7kCsjBZ%2FFrsWW%2Fe@maniforge.lan/T/
[1]: https://lore.kernel.org/lkml/20230123171506.71995-1-void@manifault.com/

Changelog:
----------
v2 -> v3:
- Go back to the __bpf_kfunc approach from v1. The BPF_KFUNC macro
  received pushback as it didn't match the more typical EXPORT_SYMBOL*
  APIs used elsewhere in the kernel. It's the longer term plan, but for
  now we're proposing something less controversial to fix kfuncs and BTF
  encoding.
- Add __bpf_kfunc macro to newly added cpumask kfuncs.
- Add __bpf_kfunc macro to newly added XDP metadata kfuncs, which were
  failing to be BTF encoded in the thread in [2].
- Update patch description(s) to reference the discussions in [2].
- Add a selftest that validates that a static kfunc with unused args is
  properly BTF encoded and can be invoked.

[2]: https://lore.kernel.org/all/fe5d42d1-faad-d05e-99ad-1c2c04776950@oracle.com/

v1 -> v2:
- Wrap entire function signature in BPF_KFUNC macro instead of using
  __bpf_kfunc tag (Kumar)
- Update all kfunc definitions to use this macro.
- Update kfuncs.rst documentation to describe and illustrate the macro.
- Also clean up a few small parts of kfuncs.rst, e.g. some grammar, and
  in general making it a bit tighter.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

36b0fb13

selftests/bpf: Add testcase for static kfunc with unused arg · 6aed15e3

David Vernet authored Feb 01, 2023

kfuncs are allowed to be static, or not use one or more of their
arguments. For example, bpf_xdp_metadata_rx_hash() in net/core/xdp.c is
meant to be implemented by drivers, with the default implementation just
returning -EOPNOTSUPP. As described in [0], such kfuncs can have their
arguments elided, which can cause BTF encoding to be skipped. The new
__bpf_kfunc macro should address this, and this patch adds a selftest
which verifies that a static kfunc with at least one unused argument can
still be encoded and invoked by a BPF program.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230201173016.342758-5-void@manifault.com

6aed15e3

bpf: Add __bpf_kfunc tag to all kfuncs · 400031e0

David Vernet authored Feb 01, 2023

Now that we have the __bpf_kfunc tag, we should use add it to all
existing kfuncs to ensure that they'll never be elided in LTO builds.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com

400031e0

bpf: Document usage of the new __bpf_kfunc macro · 98e6ab7a

David Vernet authored Feb 01, 2023

Now that the __bpf_kfunc macro has been added to linux/btf.h, include a
blurb about it in the kfuncs.rst file. In order for the macro to
successfully render with .. kernel-doc, we'll also need to add it to the
c_id_attributes array.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230201173016.342758-3-void@manifault.com

98e6ab7a

bpf: Add __bpf_kfunc tag for marking kernel functions as kfuncs · 57e7c169

David Vernet authored Feb 01, 2023

kfuncs are functions defined in the kernel, which may be invoked by BPF
programs. They may or may not also be used as regular kernel functions,
implying that they may be static (in which case the compiler could e.g.
inline it away, or elide one or more arguments), or it could have
external linkage, but potentially be elided in an LTO build if a
function is observed to never be used, and is stripped from the final
kernel binary.

This has already resulted in some issues, such as those discussed in [0]
wherein changes in DWARF that identify when a parameter has been
optimized out can break BTF encodings (and in general break the kfunc).

[0]: https://lore.kernel.org/all/1675088985-20300-2-git-send-email-alan.maguire@oracle.com/

We therefore require some convenience macro that kfunc developers can
use just add to their kfuncs, and which will prevent all of the above
issues from happening. This is in contrast with what we have today,
where some kfunc definitions have "noinline", some have "__used", and
others are static and have neither.

Note that longer term, this mechanism may be replaced by a macro that
more closely resembles EXPORT_SYMBOL_GPL(), as described in [1]. For
now, we're going with this shorter-term approach to fix existing issues
in kfuncs.

[1]: https://lore.kernel.org/lkml/Y9AFT4pTydKh+PD3@maniforge.lan/

Note as well that checkpatch complains about this patch with the
following:

ERROR: Macros with complex values should be enclosed in parentheses
+#define __bpf_kfunc __used noinline

There seems to be a precedent for using this pattern in other places
such as compiler_types.h (see e.g. __randomize_layout and noinstr), so
it seems appropriate.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230201173016.342758-2-void@manifault.com

57e7c169

Merge branch 'xdp-ice-mbuf' · 10d1b0e4

Daniel Borkmann authored Feb 01, 2023

Maciej Fijalkowski says:

====================
Although this work started as an effort to add multi-buffer XDP support
to ice driver, as usual it turned out that some other side stuff needed
to be addressed, so let me give you an overview.

First patch adjusts legacy-rx in a way that it will be possible to refer
to skb_shared_info being at the end of the buffer when gathering up
frame fragments within xdp_buff.

Then, patches 2-9 prepare ice driver in a way that actual multi-buffer
patches will be easier to swallow.

10 and 11 are the meat. What is worth mentioning is that this set
actually *fixes* things as patch 11 removes the logic based on
next_dd/rs and we previously stepped away from this for ice_xmit_zc().
Currently, AF_XDP ZC XDP_TX workload is off as there are two cleaning
sides that can be triggered and two of them work on different internal
logic. This set unifies that and allows us to improve the performance by
2x with a trick on the last (13) patch.

12th is a simple cleanup of no longer fields from Tx ring.

I might be wrong but I have not seen anyone reporting performance impact
among patches that add XDP multi-buffer support to a particular driver.
Numbers below were gathered via xdp_rxq_info and xdp_redirect_map on
1500 MTU:

XDP_DROP      +1%
XDP_PASS      -1,2%
XDP_TX        -0,5%
XDP_REDIRECT  -3,3%

Cherry on top, which is not directly related to mbuf support (last
patch):

XDP_TX ZC +126%

Target the we agreed on was to not degrade performance for any action by
anything that would be over 5%, so our goal was met. Basically this set
keeps the performance where it was. Redirect is slower due to more
frequent tail bumps.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>

10d1b0e4

ice: xsk: Do not convert to buff to frame for XDP_TX · a24b4c6e

Maciej Fijalkowski authored Jan 31, 2023

Let us store pointer to xdp_buff that came from xsk_buff_pool on tx_buf
so that it will be possible to recycle it via xsk_buff_free() on Tx
cleaning side. This way it is not necessary to do expensive copy to
another xdp_buff backed by a newly allocated page.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-14-maciej.fijalkowski@intel.com

a24b4c6e

ice: Remove next_{dd,rs} fields from ice_tx_ring · f4db7b31

Maciej Fijalkowski authored Jan 31, 2023

Now that both ZC and standard XDP data paths stopped using Tx logic
based on next_dd and next_rs fields, we can safely remove these fields
and shrink Tx ring structure.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-13-maciej.fijalkowski@intel.com

f4db7b31

ice: Add support for XDP multi-buffer on Tx side · 3246a107

Maciej Fijalkowski authored Jan 31, 2023

Similarly as for Rx side in previous patch, logic on XDP Tx in ice
driver needs to be adjusted for multi-buffer support. Specifically, the
way how HW Tx descriptors are produced and cleaned.

Currently, XDP_TX works on strict ring boundaries, meaning it sets RS
bit (on producer side) / looks up DD bit (on consumer/cleaning side)
every quarter of the ring. It means that if for example multi buffer
frame would span across the ring quarter boundary (say that frame
consists of 4 frames and we start from 62 descriptor where ring is sized
to 256 entries), RS bit would be produced in the middle of multi buffer
frame, which would be a broken behavior as it needs to be set on the
last descriptor of the frame.

To make it work, set RS bit at the last descriptor from the batch of
frames that XDP_TX action was used on and make the first entry remember
the index of last descriptor with RS bit set. This way, cleaning side
can take the index of descriptor with RS bit, look up DD bit's presence
and clean from first entry to last.

In order to clean up the code base introduce the common ice_set_rs_bit()
which will return index of descriptor that got RS bit produced on so
that standard driver can store this within proper ice_tx_buf and ZC
driver can simply ignore return value.
Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com

3246a107

ice: Add support for XDP multi-buffer on Rx side · 2fba7dc5

Maciej Fijalkowski authored Jan 31, 2023

Ice driver needs to be a bit reworked on Rx data path in order to
support multi-buffer XDP. For skb path, it currently works in a way that
Rx ring carries pointer to skb so if driver didn't manage to combine
fragmented frame at current NAPI instance, it can restore the state on
next instance and keep looking for last fragment (so descriptor with EOP
bit set). What needs to be achieved is that xdp_buff needs to be
combined in such way (linear + frags part) in the first place. Then skb
will be ready to go in case of XDP_PASS or BPF program being not present
on interface. If BPF program is there, it would work on multi-buffer
XDP. At this point xdp_buff resides directly on Rx ring, so given the
fact that skb will be built straight from xdp_buff, there will be no
further need to carry skb on Rx ring.

Besides removing skb pointer from Rx ring, lots of members have been
moved around within ice_rx_ring. First and foremost reason was to place
rx_buf with xdp_buff on the same cacheline. This means that once we
touch rx_buf (which is a preceding step before touching xdp_buff),
xdp_buff will already be hot in cache. Second thing was that xdp_rxq is
used rather rarely and it occupies a separate cacheline, so maybe it is
better to have it at the end of ice_rx_ring.

Other change that affects ice_rx_ring is the introduction of
ice_rx_ring::first_desc. Its purpose is twofold - first is to propagate
rx_buf->act to all the parts of current xdp_buff after running XDP
program, so that ice_put_rx_buf() that got moved out of the main Rx
processing loop will be able to tak an appriopriate action on each
buffer. Second is for ice_construct_skb().

ice_construct_skb() has a copybreak mechanism which had an explicit
impact on xdp_buff->skb conversion in the new approach when legacy Rx
flag is toggled. It works in a way that linear part is 256 bytes long,
if frame is bigger than that, remaining bytes are going as a frag to
skb_shared_info.

This means while memcpying frags from xdp_buff to newly allocated skb,
care needs to be taken when picking the destination frag array entry.
Upon the time ice_construct_skb() is called, when dealing with
fragmented frame, current rx_buf points to the *last* fragment, but
copybreak needs to be done against the first one. That's where
ice_rx_ring::first_desc helps.

When frame building spans across NAPI polls (DD bit is not set on
current descriptor and xdp->data is not NULL) with current Rx buffer
handling state there might be some problems.
Since calls to ice_put_rx_buf() were pulled out of the main Rx
processing loop and were scoped from cached_ntc to current ntc, remember
that now mentioned function relies on rx_buf->act, which is set within
ice_run_xdp(). ice_run_xdp() is called when EOP bit was found, so
currently we could put Rx buffer with rx_buf->act being *uninitialized*.
To address this, change scoping to rely on first_desc on both boundaries
instead.

This also implies that cleaned_count which is used as an input to
ice_alloc_rx_buffers() and tells how many new buffers should be refilled
has to be adjusted. If it stayed as is, what could happen is a case
where ntc would go over ntu.

Therefore, remove cleaned_count altogether and use against allocing
routine newly introduced ICE_RX_DESC_UNUSED() macro which is an
equivalent of ICE_DESC_UNUSED() dedicated for Rx side and based on
struct ice_rx_ring::first_desc instead of next_to_clean.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-11-maciej.fijalkowski@intel.com

2fba7dc5

ice: Use xdp->frame_sz instead of recalculating truesize · 8a11b334

Maciej Fijalkowski authored Jan 31, 2023

SKB path calculates truesize on three different functions, which could
be avoided as xdp_buff carries the already calculated truesize under
xdp_buff::frame_sz. If ice_add_rx_frag() is adjusted to take the
xdp_buff as an input just like functions responsible for creating
sk_buff initially, codebase could be simplified by removing these
redundant recalculations and rely on xdp_buff::frame_sz instead.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-10-maciej.fijalkowski@intel.com

8a11b334

ice: Do not call ice_finalize_xdp_rx() unnecessarily · 9070fe3d

Maciej Fijalkowski authored Jan 31, 2023

Currently ice_finalize_xdp_rx() is called only when xdp_prog is present
on VSI, which is a good thing. However, this optimization can be
enhanced and check only if any of the XDP_TX/XDP_REDIRECT took place in
current Rx processing. Non-zero value of @xdp_xmit indicates that
xdp_prog is present on VSI. This way XDP_DROP-based workloads will not
suffer from unnecessary calls to external function.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-9-maciej.fijalkowski@intel.com

9070fe3d

ice: Use ice_max_xdp_frame_size() in ice_xdp_setup_prog() · 60bc72b3

Maciej Fijalkowski authored Jan 31, 2023

This should have been used in there from day 1, let us address that
before introducing XDP multi-buffer support for Rx side.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-8-maciej.fijalkowski@intel.com

60bc72b3

ice: Centrallize Rx buffer recycling · 1dc1a7e7

Maciej Fijalkowski authored Jan 31, 2023

Currently calls to ice_put_rx_buf() are sprinkled through
ice_clean_rx_irq() - first place is for explicit flow director's
descriptor handling, second is after running XDP prog and the last one
is after taking care of skb.

1st callsite was actually only for ntc bump purpose, as Rx buffer to be
recycled is not even passed to a function.

It is possible to walk through Rx buffers processed in particular NAPI
cycle by caching ntc from beginning of the ice_clean_rx_irq().

To do so, let us store XDP verdict inside ice_rx_buf, so action we need
to take on will be known. For XDP prog absence, just store ICE_XDP_PASS
as a verdict.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-7-maciej.fijalkowski@intel.com

1dc1a7e7

ice: Inline eop check · e44f4790

Maciej Fijalkowski authored Jan 31, 2023

This might be in future used by ZC driver and might potentially yield a
minor performance boost. While at it, constify arguments that
ice_is_non_eop() takes, since they are pointers and this will help compiler
while generating asm.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-6-maciej.fijalkowski@intel.com

e44f4790

ice: Pull out next_to_clean bump out of ice_put_rx_buf() · d7956d81

Maciej Fijalkowski authored Jan 31, 2023

Plan is to move ice_put_rx_buf() to the end of ice_clean_rx_irq() so
in order to keep the ability of walking through HW Rx descriptors, pull
out next_to_clean handling out of ice_put_rx_buf().
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-5-maciej.fijalkowski@intel.com

d7956d81

ice: Store page count inside ice_rx_buf · ac075339

Maciej Fijalkowski authored Jan 31, 2023

This will allow us to avoid carrying additional auxiliary array of page
counts when dealing with XDP multi buffer support. Previously combining
fragmented frame to skb was not affected in the same way as XDP would be
as whole frame is needed to be in place before executing XDP prog.
Therefore, when going through HW Rx descriptors one-by-one, calls to
ice_put_rx_buf() need to be taken *after* running XDP prog on a
potentially multi buffered frame, so some additional storage of
page count is needed.

By adding page count to rx buf, it will make it easier to walk through
processed entries at the end of rx cleaning routine and decide whether
or not buffers should be recycled.

While at it, bump ice_rx_buf::pagecnt_bias from u16 up to u32. It was
proven many times that calculations on variables smaller than standard
register size are harmful. This was also the case during experiments
with embedding page count to ice_rx_buf - when this was added as u16 it
had a performance impact.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-4-maciej.fijalkowski@intel.com

ac075339

ice: Add xdp_buff to ice_rx_ring struct · cb0473e0

Maciej Fijalkowski authored Jan 31, 2023

In preparation for XDP multi-buffer support, let's store xdp_buff on
Rx ring struct. This will allow us to combine fragmented frames across
separate NAPI cycles in the same way as currently skb fragments are
handled. This means that skb pointer on Rx ring will become redundant
and will be removed. For now it is kept and layout of Rx ring struct was
not inspected, some member movement will be needed later on so that will
be the time to take care of it.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-3-maciej.fijalkowski@intel.com

cb0473e0

ice: Prepare legacy-rx for upcoming XDP multi-buffer support · c61bcebd

Maciej Fijalkowski authored Jan 31, 2023

Rx path is going to be modified in a way that fragmented frame will be
gathered within xdp_buff in the first place. This approach implies that
underlying buffer has to provide tailroom for skb_shared_info. This is
currently the case when ring uses build_skb but not when legacy-rx knob
is turned on. This case configures 2k Rx buffers and has no way to
provide either headroom or tailroom - FWIW it currently has
XDP_PACKET_HEADROOM which is broken and in here it is removed. 2k Rx
buffers were used so driver in this setting was able to support 9k MTU
as it can chain up to 5 Rx buffers. With offset configuring HW writing
2k of a data was passing the half of the page which broke the assumption
of our internal page recycling tricks.

Now if above got fixed and legacy-rx path would be left as is, when
referring to skb_shared_info via xdp_get_shared_info_from_buff(),
packet's content would be corrupted again. Hence size of Rx buffer needs
to be lowered and therefore supported MTU. This operation will allow us
to keep the unified data path and with 8k MTU users (if any of
legacy-rx) would still be good to go. However, tendency is to drop the
support for this code path at some point.

Add ICE_RXBUF_1664 as vsi::rx_buf_len and ICE_MAX_FRAME_LEGACY_RX (8320)
as vsi::max_frame for legacy-rx. For bigger page sizes configure 3k Rx
buffers, not 2k.

Since headroom support is removed, disable data_meta support on legacy-rx.
When preparing XDP buff, rely on ice_rx_ring::rx_offset setting when
deciding whether to support data_meta or not.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Link: https://lore.kernel.org/bpf/20230131204506.219292-2-maciej.fijalkowski@intel.com

c61bcebd

30 Jan, 2023 7 commits

Merge branch 'Support bpf trampoline for s390x' · c1a3daf7

Alexei Starovoitov authored Jan 29, 2023

Ilya Leoshkevich says:

====================

v2: https://lore.kernel.org/bpf/20230128000650.1516334-1-iii@linux.ibm.com/#t
v2 -> v3:
- Make __arch_prepare_bpf_trampoline static.
  (Reported-by: kernel test robot <lkp@intel.com>)
- Support both old- and new- style map definitions in sk_assign. (Alexei)
- Trim DENYLIST.s390x. (Alexei)
- Adjust s390x vmlinux path in vmtest.sh.
- Drop merged fixes.

v1: https://lore.kernel.org/bpf/20230125213817.1424447-1-iii@linux.ibm.com/#t
v1 -> v2:
- Fix core_read_macros, sk_assign, test_profiler, test_bpffs (24/31;
  I'm not quite happy with the fix, but don't have better ideas),
  and xdp_synproxy. (Andrii)
- Prettify liburandom_read and verify_pkcs7_sig fixes. (Andrii)
- Fix bpf_usdt_arg using barrier_var(); prettify barrier_var(). (Andrii)
- Change BPF_MAX_TRAMP_LINKS to enum and query it using BTF. (Andrii)
- Improve bpf_jit_supports_kfunc_call() description. (Alexei)
- Always check sign_extend() return value.
- Cc: Alexander Gordeev.

Hi,

This series implements poke, trampoline, kfunc, and mixing subprogs
and tailcalls on s390x.

The following failures still remain:

#82      get_stack_raw_tp:FAIL
get_stack_print_output:FAIL:user_stack corrupted user stack
Known issue:
We cannot reliably unwind userspace on s390x without DWARF.

#101     ksyms_module:FAIL
address of kernel function bpf_testmod_test_mod_kfunc is out of range
Known issue:
Kernel and modules are too far away from each other on s390x.

#190     stacktrace_build_id:FAIL
Known issue:
We cannot reliably unwind userspace on s390x without DWARF.

#281     xdp_metadata:FAIL
See patch 6.

None of these seem to be due to the new changes.

Best regards,
Ilya
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

c1a3daf7

selftests/bpf: Trim DENYLIST.s390x · ee105d5a

Ilya Leoshkevich authored Jan 29, 2023

Now that trampoline is implemented, enable a number of tests on s390x.
18 of the remaining failures have to do with either lack of rethook
(fixed by [1]) or syscall symbols missing from BTF (fixed by [2]).

Do not re-classify the remaining failures for now; wait until the
s390/for-next fixes are merged and re-classify only the remaining few.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=for-next&id=1a280f48c0e403903cf0b4231c95b948e664f25a
[2] https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=for-next&id=2213d44e140f979f4b60c3c0f8dd56d151cc8692Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-9-iii@linux.ibm.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

ee105d5a

selftests/bpf: Fix s390x vmlinux path · af320fb7

Ilya Leoshkevich authored Jan 29, 2023

After commit edd4a866 ("s390/boot: get rid of startup archive")
there is no more compressed/ subdirectory.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-8-iii@linux.ibm.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

af320fb7

s390/bpf: Implement bpf_jit_supports_kfunc_call() · 63d7b53a

Ilya Leoshkevich authored Jan 29, 2023

Implement calling kernel functions from eBPF. In general, the eBPF ABI
is fairly close to that of s390x, with one important difference: on
s390x callers should sign-extend signed arguments. Handle that by using
information returned by bpf_jit_find_kfunc_model().

Here is an example of how sign extensions works. Suppose we need to
call the following function from BPF:

    ; long noinline bpf_kfunc_call_test4(signed char a, short b, int c,
long d)
    0000000000936a78 <bpf_kfunc_call_test4>:
    936a78:       c0 04 00 00 00 00       jgnop bpf_kfunc_call_test4
    ;     return (long)a + (long)b + (long)c + d;
    936a7e:       b9 08 00 45             agr     %r4,%r5
    936a82:       b9 08 00 43             agr     %r4,%r3
    936a86:       b9 08 00 24             agr     %r2,%r4
    936a8a:       c0 f4 00 1e 3b 27       jg      <__s390_indirect_jump_r14>

As per the s390x ABI, bpf_kfunc_call_test4() has the right to assume
that a, b and c are sign-extended by the caller, which results in using
64-bit additions (agr) without any additional conversions. Without sign
extension we would have the following on the JITed code side:

    ; tmp = bpf_kfunc_call_test4(-3, -30, -200, -1000);
    ;        5:       b4 10 00 00 ff ff ff fd w1 = -3
    0x3ff7fdcdad4:       llilf   %r2,0xfffffffd
    ;        6:       b4 20 00 00 ff ff ff e2 w2 = -30
    0x3ff7fdcdada:       llilf   %r3,0xffffffe2
    ;        7:       b4 30 00 00 ff ff ff 38 w3 = -200
    0x3ff7fdcdae0:       llilf   %r4,0xffffff38
    ;       8:       b7 40 00 00 ff ff fc 18 r4 = -1000
    0x3ff7fdcdae6:       lgfi    %r5,-1000
    0x3ff7fdcdaec:       mvc     64(4,%r15),160(%r15)
    0x3ff7fdcdaf2:       lgrl    %r1,bpf_kfunc_call_test4@GOT
    0x3ff7fdcdaf8:       brasl   %r14,__s390_indirect_jump_r1

This first 3 llilfs are 32-bit loads, that need to be sign-extended
to 64 bits.

Note: at the moment bpf_jit_find_kfunc_model() does not seem to play
nicely with XDP metadata functions: add_kfunc_call() adds an "abstract"
bpf_*() version to kfunc_btf_tab, but then fixup_kfunc_call() puts the
concrete version into insn->imm, which bpf_jit_find_kfunc_model() cannot
find. But this seems to be a common code problem.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-7-iii@linux.ibm.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

63d7b53a

s390/bpf: Implement bpf_jit_supports_subprog_tailcalls() · dd691e84

Ilya Leoshkevich authored Jan 29, 2023

Allow mixing subprogs and tail calls by passing the current tail
call count to subprogs.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-6-iii@linux.ibm.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

dd691e84

s390/bpf: Implement arch_prepare_bpf_trampoline() · 528eb2cb

Ilya Leoshkevich authored Jan 29, 2023

arch_prepare_bpf_trampoline() is used for direct attachment of eBPF
programs to various places, bypassing kprobes. It's responsible for
calling a number of eBPF programs before, instead and/or after
whatever they are attached to.

Add a s390x implementation, paying attention to the following:

- Reuse the existing JIT infrastructure, where possible.
- Like the existing JIT, prefer making multiple passes instead of
  backpatching. Currently 2 passes is enough. If literal pool is
  introduced, this needs to be raised to 3. However, at the moment
  adding literal pool only makes the code larger. If branch
  shortening is introduced, the number of passes needs to be
  increased even further.
- Support both regular and ftrace calling conventions, depending on
  the trampoline flags.
- Use expolines for indirect calls.
- Handle the mismatch between the eBPF and the s390x ABIs.
- Sign-extend fmod_ret return values.

invoke_bpf_prog() produces about 120 bytes; it might be possible to
slightly optimize this, but reaching 50 bytes, like on x86_64, looks
unrealistic: just loading cookie, __bpf_prog_enter, bpf_func, insnsi
and __bpf_prog_exit as literals already takes at least 5 * 12 = 60
bytes, and we can't use relative addressing for most of them.
Therefore, lower BPF_MAX_TRAMP_LINKS on s390x.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-5-iii@linux.ibm.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

528eb2cb

s390/bpf: Implement bpf_arch_text_poke() · f1d5df84

Ilya Leoshkevich authored Jan 29, 2023

bpf_arch_text_poke() is used to hotpatch eBPF programs and trampolines.
s390x has a very strict hotpatching restriction: the only thing that is
allowed to be hotpatched is conditional branch mask.

Take the same approach as commit de5012b4 ("s390/ftrace: implement
hotpatching"): create a conditional jump to a "plt", which loads the
target address from memory and jumps to it; then first patch this
address, and then the mask.

Trampolines (introduced in the next patch) respect the ftrace calling
convention: the return address is in %r0, and %r1 is clobbered. With
that in mind, bpf_arch_text_poke() does not differentiate between jumps
and calls.

However, there is a simple optimization for jumps (for the epilogue_ip
case): if a jump already points to the destination, then there is no
"plt" and we can just flip the mask.

For simplicity, the "plt" template is defined in assembly, and its size
is used to define C arrays. There doesn't seem to be a way to convey
this size to C as a constant, so it's hardcoded and double-checked
during runtime.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20230129190501.1624747-4-iii@linux.ibm.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

f1d5df84