- 03 Jul, 2019 14 commits
-
-
Daniel Borkmann authored
Stanislav Fomichev says: ==================== Congestion control team would like to have a periodic callback to track some TCP statistics. Let's add a sock_ops callback that can be selectively enabled on a socket by socket basis and is executed for every RTT. BPF program frequency can be further controlled by calling bpf_ktime_get_ns and bailing out early. I run neper tcp_stream and tcp_rr tests with the sample program from the last patch and didn't observe any noticeable performance difference. v2: * add a comment about second accept() in selftest (Yonghong Song) * refer to tcp_bpf.readme in sample program (Yonghong Song) ==================== Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Stanislav Fomichev authored
Copy-paste, should be detach, not attach. Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Stanislav Fomichev authored
Uses new RTT callback to dump stats every second. $ mkdir -p /tmp/cgroupv2 $ mount -t cgroup2 none /tmp/cgroupv2 $ mkdir -p /tmp/cgroupv2/foo $ echo $$ >> /tmp/cgroupv2/foo/cgroup.procs $ bpftool prog load ./tcp_dumpstats_kern.o /sys/fs/bpf/tcp_prog $ bpftool cgroup attach /tmp/cgroupv2/foo sock_ops pinned /sys/fs/bpf/tcp_prog $ bpftool prog tracelog $ # run neper/netperf/etc Used neper to compare performance with and without this program attached and didn't see any noticeable performance impact. Sample output: <idle>-0 [015] ..s. 2074.128800: 0: dsack_dups=0 delivered=242526 <idle>-0 [015] ..s. 2074.128808: 0: delivered_ce=0 icsk_retransmits=0 <idle>-0 [015] ..s. 2075.130133: 0: dsack_dups=0 delivered=323599 <idle>-0 [015] ..s. 2075.130138: 0: delivered_ce=0 icsk_retransmits=0 <idle>-0 [005] .Ns. 2076.131440: 0: dsack_dups=0 delivered=404648 <idle>-0 [005] .Ns. 2076.131447: 0: delivered_ce=0 icsk_retransmits=0 Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Stanislav Fomichev authored
Make sure the callback is invoked for syn-ack and data packet. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Stanislav Fomichev authored
Sync new bpf_tcp_sock fields and new BPF_PROG_TYPE_SOCK_OPS RTT callback. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Stanislav Fomichev authored
Add some inet_connection_sock fields to bpf_tcp_sock that might be useful for debugging congestion control issues. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Stanislav Fomichev authored
Add more fields to bpf_tcp_sock that might be useful for debugging congestion control issues. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Stanislav Fomichev authored
We've added bpf_tcp_sock member to bpf_sock_ops and don't expect any new tcp_sock fields in bpf_sock_ops. Let's remove CONVERT_COMMON_TCP_SOCK_FIELDS so bpf_tcp_sock can be independently extended. Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Stanislav Fomichev authored
Performance impact should be minimal because it's under a new BPF_SOCK_OPS_RTT_CB_FLAG flag that has to be explicitly enabled. Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jiri Benc authored
The progs for bpf selftests use several different notations to force function inlining. Standardize to what most of them use, static __always_inline. Suggested-by: Song Liu <liu.song.a23@gmail.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
brakmo authored
Adds support for fq's Earliest Departure Time to HBM (Host Bandwidth Manager). Includes a new BPF program supporting EDT, and also updates corresponding programs. It will drop packets with an EDT of more than 500us in the future unless the packet belongs to a flow with less than 2 packets in flight. This is done so each flow has at least 2 packets in flight, so they will not starve, and also to help prevent delayed ACK timeouts. It will also work with ECN enabled traffic, where the packets will be CE marked if their EDT is more than 50us in the future. The table below shows some performance numbers. The flows are back to back RPCS. One server sending to another, either 2 or 4 flows. One flow is a 10KB RPC, the rest are 1MB RPCs. When there are more than one flow of a given RPC size, the numbers represent averages. The rate limit applies to all flows (they are in the same cgroup). Tests ending with "-edt" ran with the new BPF program supporting EDT. Tests ending with "-hbt" ran on top HBT qdisc with the specified rate (i.e. no HBM). The other tests ran with the HBM BPF program included in the HBM patch-set. EDT has limited value when using DCTCP, but it helps in many cases when using Cubic. It usually achieves larger link utilization and lower 99% latencies for the 1MB RPCs. HBM ends up queueing a lot of packets with its default parameter values, reducing the goodput of the 10KB RPCs and increasing their latency. Also, the RTTs seen by the flows are quite large. Aggr 10K 10K 10K 1MB 1MB 1MB Limit rate drops RTT rate P90 P99 rate P90 P99 Test rate Flows Mbps % us Mbps us us Mbps ms ms -------- ---- ----- ---- ----- --- ---- ---- ---- ---- ---- ---- cubic 1G 2 904 0.02 108 257 511 539 647 13.4 24.5 cubic-edt 1G 2 982 0.01 156 239 656 967 743 14.0 17.2 dctcp 1G 2 977 0.00 105 324 408 744 653 14.5 15.9 dctcp-edt 1G 2 981 0.01 142 321 417 811 660 15.7 17.0 cubic-htb 1G 2 919 0.00 1825 40 2822 4140 879 9.7 9.9 cubic 200M 2 155 0.30 220 81 532 655 74 283 450 cubic-edt 200M 2 188 0.02 222 87 1035 1095 101 84 85 dctcp 200M 2 188 0.03 111 77 912 939 111 76 325 dctcp-edt 200M 2 188 0.03 217 74 1416 1738 114 76 79 cubic-htb 200M 2 188 0.00 5015 8 14ms 15ms 180 48 50 cubic 1G 4 952 0.03 110 165 516 546 262 38 154 cubic-edt 1G 4 973 0.01 190 111 1034 1314 287 65 79 dctcp 1G 4 951 0.00 103 180 617 905 257 37 38 dctcp-edt 1G 4 967 0.00 163 151 732 1126 272 43 55 cubic-htb 1G 4 914 0.00 3249 13 7ms 8ms 300 29 34 cubic 5G 4 4236 0.00 134 305 490 624 1310 10 17 cubic-edt 5G 4 4865 0.00 156 306 425 759 1520 10 16 dctcp 5G 4 4936 0.00 128 485 221 409 1484 7 9 dctcp-edt 5G 4 4924 0.00 148 390 392 623 1508 11 26 v1 -> v2: Incorporated Andrii's suggestions v2 -> v3: Incorporated Yonghong's suggestions v3 -> v4: Removed credit update that is not needed Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Leo Yan authored
Based on the following report from Smatch, fix the potential NULL pointer dereference check: tools/lib/bpf/libbpf.c:3493 bpf_prog_load_xattr() warn: variable dereferenced before check 'attr' (see line 3483) 3479 int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr, 3480 struct bpf_object **pobj, int *prog_fd) 3481 { 3482 struct bpf_object_open_attr open_attr = { 3483 .file = attr->file, 3484 .prog_type = attr->prog_type, ^^^^^^ 3485 }; At the head of function, it directly access 'attr' without checking if it's NULL pointer. This patch moves the values assignment after validating 'attr' and 'attr->file'. Signed-off-by: Leo Yan <leo.yan@linaro.org> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrii Nakryiko authored
GCC8 started emitting warning about using strncpy with number of bytes exactly equal destination size, which is generally unsafe, as can lead to non-zero terminated string being copied. Use IFNAMSIZ - 1 as number of bytes to ensure name is always zero-terminated. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Alexei Starovoitov authored
When equivalent state is found the current state needs to propagate precision marks. Otherwise the verifier will prune the search incorrectly. There is a price for correctness: before before broken fixed cnst spill precise precise bpf_lb-DLB_L3.o 1923 8128 1863 1898 bpf_lb-DLB_L4.o 3077 6707 2468 2666 bpf_lb-DUNKNOWN.o 1062 1062 544 544 bpf_lxc-DDROP_ALL.o 166729 380712 22629 36823 bpf_lxc-DUNKNOWN.o 174607 440652 28805 45325 bpf_netdev.o 8407 31904 6801 7002 bpf_overlay.o 5420 23569 4754 4858 bpf_lxc_jit.o 39389 359445 50925 69631 Overall precision tracking is still very effective. Fixes: b5dc0163 ("bpf: precise scalar_value tracking") Reported-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Tested-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 28 Jun, 2019 7 commits
-
-
Daniel Borkmann authored
Toke Høiland-Jørgensen says: ==================== When using the bpf_redirect_map() helper to redirect packets from XDP, the eBPF program cannot currently know whether the redirect will succeed, which makes it impossible to gracefully handle errors. To properly fix this will probably require deeper changes to the way TX resources are allocated, but one thing that is fairly straight forward to fix is to allow lookups into devmaps, so programs can at least know when a redirect is *guaranteed* to fail because there is no entry in the map. Currently, programs work around this by keeping a shadow map of another type which indicates whether a map index is valid. This series contains two changes that are complementary ways to fix this issue: - Moving the map lookup into the bpf_redirect_map() helper (and caching the result), so the helper can return an error if no value is found in the map. This includes a refactoring of the devmap and cpumap code to not care about the index on enqueue. - Allowing regular lookups into devmaps from eBPF programs, using the read-only flag to make sure they don't change the values. The performance impact of the series is negligible, in the sense that I cannot measure it because the variance between test runs is higher than the difference pre/post series. Changelog: v6: - Factor out list handling in maps to a helper in list.h (new patch 1) - Rename variables in struct bpf_redirect_info (new patch 3 + patch 4) - Explain why we are clearing out the map in the info struct on lookup failure - Remove unneeded check for forwarding target in tracepoint macro v5: - Rebase on latest bpf-next. - Update documentation for bpf_redirect_map() with the new meaning of flags. v4: - Fix a few nits from Andrii - Lose the #defines in bpf.h and just compare the flags argument directly to XDP_TX in bpf_xdp_redirect_map(). v3: - Adopt Jonathan's idea of using the lower two bits of the flag value as the return code. - Always do the lookup, and cache the result for use in xdp_do_redirect(); to achieve this, refactor the devmap and cpumap code to get rid the bitmap for selecting which devices to flush. v2: - For patch 1, make it clear that the change works for any map type. - For patch 2, just use the new BPF_F_RDONLY_PROG flag to make the return value read-only. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Toke Høiland-Jørgensen authored
We don't currently allow lookups into a devmap from eBPF, because the map lookup returns a pointer directly to the dev->ifindex, which shouldn't be modifiable from eBPF. However, being able to do lookups in devmaps is useful to know (e.g.) whether forwarding to a specific interface is enabled. Currently, programs work around this by keeping a shadow map of another type which indicates whether a map index is valid. Since we now have a flag to make maps read-only from the eBPF side, we can simply lift the lookup restriction if we make sure this flag is always set. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Toke Høiland-Jørgensen authored
The bpf_redirect_map() helper used by XDP programs doesn't return any indication of whether it can successfully redirect to the map index it was given. Instead, BPF programs have to track this themselves, leading to programs using duplicate maps to track which entries are populated in the devmap. This patch fixes this by moving the map lookup into the bpf_redirect_map() helper, which makes it possible to return failure to the eBPF program. The lower bits of the flags argument is used as the return code, which means that existing users who pass a '0' flag argument will get XDP_ABORTED. With this, a BPF program can check the return code from the helper call and react by, for instance, substituting a different redirect. This works for any type of map used for redirect. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Toke Høiland-Jørgensen authored
The bpf_redirect_info struct has an 'ifindex' member which was named back when the redirects could only target egress interfaces. Now that we can also redirect to sockets and CPUs, this is a bit misleading, so rename the member to tgt_index. Reorder the struct members so we can have 'tgt_index' and 'tgt_value' next to each other in a subsequent patch. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Toke Høiland-Jørgensen authored
The socket map uses a linked list instead of a bitmap to keep track of which entries to flush. Do the same for devmap and cpumap, as this means we don't have to care about the map index when enqueueing things into the map (and so we can cache the map lookup). Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Toke Høiland-Jørgensen authored
Add a helper in list.h for the non-standard way of clearing a list that is used in xskmap. This makes it easier to reuse it in the other map types, and also makes sure this usage is not forgotten in any list refactorings in the future. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Stanislav Fomichev authored
Let's use union with u8[4] and u32 members for sockopt buffer, that should fix any possible aliasing issues. test_sockopt_sk.c: In function ‘getsetsockopt’: test_sockopt_sk.c:115:2: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] if (*(__u32 *)buf != 0x55AA*2) { ^~ test_sockopt_sk.c:116:3: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] log_err("Unexpected getsockopt(SO_SNDBUF) 0x%x != 0x55AA*2", ^~~~~~~ Fixes: 8a027dc0 ("selftests/bpf: add sockopt test that exercises sk helpers") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 27 Jun, 2019 19 commits
-
-
Alexei Starovoitov authored
Stanislav Fomichev says: ==================== This series implements two new per-cgroup hooks: getsockopt and setsockopt along with a new sockopt program type. The idea is pretty similar to recently introduced cgroup sysctl hooks, but implementation is simpler (no need to convert to/from strings). What this can be applied to: * move business logic of what tos/priority/etc can be set by containers (either pass or reject) * handle existing options (or introduce new ones) differently by propagating some information in cgroup/socket local storage Compared to a simple syscall/{g,s}etsockopt tracepoint, those hooks are context aware. Meaning, they can access underlying socket and use cgroup and socket local storage. v9: * allow overwriting setsocktop arguments (Alexei Starovoitov) (see individual changes for more changelog details) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Stanislav Fomichev authored
Support sockopt prog type and cgroup hooks in the bpftool. Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Stanislav Fomichev authored
Provide user documentation about sockopt prog type and cgroup hooks. v9: * add details about setsockopt context and inheritance v7: * add description for retval=0 and optlen=-1 v6: * describe cgroup chaining, add example v2: * use return code 2 for kernel bypass Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Stanislav Fomichev authored
sockopt test that verifies chaining behavior. v9: * setsockopt chaining example v7: * rework the test to verify cgroup getsockopt chaining Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Stanislav Fomichev authored
socktop test that introduces new SOL_CUSTOM sockopt level and stores whatever users sets in sk storage. Whenever getsockopt is called, the original value is retrieved. v9: * SO_SNDBUF example to override user-supplied buffer v7: * use retval=0 and optlen-1 v6: * test 'ret=1' use-case as well (Alexei Starovoitov) v4: * don't call bpf_sk_fullsock helper v3: * drop (__u8 *)(long) casts for optval{,_end} v2: * new test Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Stanislav Fomichev authored
Add sockopt selftests: * require proper expected_attach_type * enforce context field read/write access * test bpf_sockopt_handled handler * test EPERM * test limiting optlen from getsockopt * test out-of-bounds access v9: * add tests for setsockopt argument mangling v7: * remove return 2; test retval=0 and optlen=-1 v3: * use DW for optval{,_end} loads v2: * use return code 2 for kernel bypass Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Stanislav Fomichev authored
Add tests that make sure libbpf section detection works. Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Stanislav Fomichev authored
Make libbpf aware of new sockopt hooks so it can derive prog type and hook point from the section names. Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Stanislav Fomichev authored
Export new prog type and hook points to the libbpf. Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Stanislav Fomichev authored
Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks. BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before passing them down to the kernel or bypass kernel completely. BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that kernel returns. Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure. The buffer memory is pre-allocated (because I don't think there is a precedent for working with __user memory from bpf). This might be slow to do for each {s,g}etsockopt call, that's why I've added __cgroup_bpf_prog_array_is_empty that exits early if there is nothing attached to a cgroup. Note, however, that there is a race between __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup program layout might have changed; this should not be a problem because in general there is a race between multiple calls to {s,g}etsocktop and user adding/removing bpf progs from a cgroup. The return code of the BPF program is handled as follows: * 0: EPERM * 1: success, continue with next BPF program in the cgroup chain v9: * allow overwriting setsockopt arguments (Alexei Starovoitov): * use set_fs (same as kernel_setsockopt) * buffer is always kzalloc'd (no small on-stack buffer) v8: * use s32 for optlen (Andrii Nakryiko) v7: * return only 0 or 1 (Alexei Starovoitov) * always run all progs (Alexei Starovoitov) * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov) (decided to use optval=-1 instead, optval=0 might be a valid input) * call getsockopt hook after kernel handlers (Alexei Starovoitov) v6: * rework cgroup chaining; stop as soon as bpf program returns 0 or 2; see patch with the documentation for the details * drop Andrii's and Martin's Acked-by (not sure they are comfortable with the new state of things) v5: * skip copy_to_user() and put_user() when ret == 0 (Martin Lau) v4: * don't export bpf_sk_fullsock helper (Martin Lau) * size != sizeof(__u64) for uapi pointers (Martin Lau) * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau) v3: * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko) * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii Nakryiko) * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau) * use BPF_FIELD_SIZEOF() for consistency (Martin Lau) * new CG_SOCKOPT_ACCESS macro to wrap repeated parts v2: * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau) * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau) * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau) * added [0,2] return code check to verifier (Martin Lau) * dropped unused buf[64] from the stack (Martin Lau) * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau) * dropped bpf_target_off from ctx rewrites (Martin Lau) * use return code for kernel bypass (Martin Lau & Andrii Nakryiko) Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Daniel Borkmann authored
Tariq Toukan says: ==================== This series contains improvements to the AF_XDP kernel infrastructure and AF_XDP support in mlx5e. The infrastructure improvements are required for mlx5e, but also some of them benefit to all drivers, and some can be useful for other drivers that want to implement AF_XDP. The performance testing was performed on a machine with the following configuration: - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz - Mellanox ConnectX-5 Ex with 100 Gbit/s link The results with retpoline disabled, single stream: txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU) rxdrop: 12.2 Mpps l2fwd: 9.4 Mpps The results with retpoline enabled, single stream: txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU) rxdrop: 9.9 Mpps l2fwd: 6.8 Mpps v2 changes: Added patches for mlx5e and addressed the comments for v1. Rebased for bpf-next. v3 changes: Rebased for the newer bpf-next, resolved conflicts in libbpf. Addressed Björn's comments for coding style. Fixed a bug in error handling flow in mlx5e_open_xsk. v4 changes: UAPI is not changed, XSK RX queues are exposed to the kernel. The lower half of the available amount of RX queues are regular queues, and the upper half are XSK RX queues. The patch "xsk: Extend channels to support combined XSK/non-XSK traffic" was dropped. The final patch was reworked accordingly. Added "net/mlx5e: Attach/detach XDP program safely", as the changes introduced in the XSK patch base on the stuff from this one. Added "libbpf: Support drivers with non-combined channels", which aligns the condition in libbpf with the condition in the kernel. Rebased over the newer bpf-next. v5 changes: In v4, ethtool reports the number of channels as 'combined' and the number of XSK RX queues as 'rx' for mlx5e. It was changed, so that 'rx' is 0, and 'combined' reports the double amount of channels if there is an active UMEM - to make libbpf happy. The patch for libbpf was dropped. Although it's still useful and fixes things, it raises some disagreement, so I'm dropping it - it's no longer useful for mlx5e anymore after the change above. v6 changes: As Maxim is out of office, I rebased the series on behalf of him, solved some conflicts, and re-spinned. ==================== Acked-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Maxim Mikityanskiy authored
This commit adds support for AF_XDP zero-copy RX and TX. We create a dedicated XSK RQ inside the channel, it means that two RQs are running simultaneously: one for non-XSK traffic and the other for XSK traffic. The regular and XSK RQs use a single ID namespace split into two halves: the lower half is regular RQs, and the upper half is XSK RQs. When any zero-copy AF_XDP socket is active, changing the number of channels is not allowed, because it would break to mapping between XSK RQ IDs and channels. XSK requires different page allocation and release routines. Such functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are generic enough to be used for both regular and XSK RQs, and they use the mlx5e_page_{alloc,release} wrappers around the real allocation functions. Function pointers are not used to avoid losing the performance with retpolines. Wherever it's certain that the regular (non-XSK) page release function should be used, it's called directly. Only the stats that could be meaningful for XSK are exposed to the userspace. Those that don't take part in the XSK flow are not considered. Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ), because the newer xdpsock sample doesn't provide any Fill Ring entries at the setup stage. We create a dedicated XSK SQ in the channel. This separation has its advantages: 1. When the UMEM is closed, the XSK SQ can also be closed and stop receiving completions. If an existing SQ was used for XSK, it would continue receiving completions for the packets of the closed socket. If a new UMEM was opened at that point, it would start getting completions that don't belong to it. 2. Calculating statistics separately. When the userspace kicks the TX, the driver triggers a hardware interrupt by posting a NOP to a dedicated XSK ICO (internal control operations) SQ, in order to trigger NAPI on the right CPU core. This XSK ICO SQ is protected by a spinlock, as the userspace application may kick the TX from any core. Store the pointers to the UMEMs in the net device private context, independently from the kernel. This way the driver can distinguish between the zero-copy and non-zero-copy UMEMs. The kernel function xdp_get_umem_from_qid does not care about this difference, but the driver is only interested in zero-copy UMEMs, particularly, on the cleanup it determines whether to close the XSK RQ and SQ or not by looking at the presence of the UMEM. Use state_lock to protect the access to this area of UMEM pointers. LRO isn't compatible with XDP, but there may be active UMEMs while XDP is off. If this is the case, don't allow LRO to ensure XDP can be reenabled at any time. The validation of XSK parameters typically happens when XSK queues open. However, when the interface is down or the XDP program isn't set, it's still possible to have active AF_XDP sockets and even to open new, but the XSK queues will be closed. To cover these cases, perform the validation also in these flows: 1. A new UMEM is registered, but the XSK queues aren't going to be created due to missing XDP program or interface being down. 2. MTU changes while there are UMEMs registered. Having this early check prevents mlx5e_open_channels from failing at a later stage, where recovery is impossible and the application has no chance to handle the error, because it got the successful return value for an MTU change or XSK open operation. The performance testing was performed on a machine with the following configuration: - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz - Mellanox ConnectX-5 Ex with 100 Gbit/s link The results with retpoline disabled, single stream: txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU) rxdrop: 12.2 Mpps l2fwd: 9.4 Mpps The results with retpoline enabled, single stream: txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU) rxdrop: 9.9 Mpps l2fwd: 6.8 Mpps Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Maxim Mikityanskiy authored
structs mlx5e_{rq,sq,cq,channel}_param are going to be used in the upcoming XSK RX and TX patches. Move them to a header file to make them accessible from other C files. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Maxim Mikityanskiy authored
Create new functions mlx5e_{open,close}_queues to encapsulate opening and closing RQs and SQs, and call the new functions from mlx5e_{open,close}_channel. It simplifies the existing functions a bit and prepares them for the upcoming AF_XDP changes. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Maxim Mikityanskiy authored
Use the existing mlx5e_get_linear_rq_headroom function to calculate the headroom for mlx5e_xdp_max_mtu. This function takes the XSK headroom into consideration, which will be used in the following patches. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Maxim Mikityanskiy authored
When an XDP program returns XDP_TX, and the RQ is XSK-enabled, it requires careful handling, because convert_to_xdp_frame creates a new page and copies the data there, while our driver expects the xdp_frame to point to the same memory as the xdp_buff. Handle this case separately: map the page, and in the end unmap it and call xdp_return_frame. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Maxim Mikityanskiy authored
Put the XDP SQ that is used for XDP_TX into the channel. It used to be a part of the RQ, but with introduction of AF_XDP there will be one more RQ that could share the same XDP SQ. This patch is a preparation for that change. Separate XDP_TX statistics per RQ were implemented in one of the previous patches. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Maxim Mikityanskiy authored
Currently, struct mlx5e_xdp_info has some issues that have to be cleaned up before the upcoming AF_XDP support makes things too complicated and messy. This structure is used both when sending the packet and on completion. Moreover, the cleanup procedure on completion depends on the origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will add new flows that use this structure even differently. To avoid overcomplicating the code, this commit refactors the usage of this structure in the following ways: 1. struct mlx5e_xdp_info is split into two different structures. One is struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to be stored and is only used while sending the packet. The other is still struct mlx5e_xdp_info that is stored in a FIFO and contains the fields needed on completion. 2. The fields of struct mlx5e_xdp_info that are used in different flows are put into a union. A special enum indicates the cleanup mode and helps choose the right union member. This approach is clear and explicit. Although it could be possible to "guess" the mode by looking at the values of the fields and at the XDP SQ type, it wouldn't be that clear and extendable and would require looking through the whole chain to understand what's going on. For the reference, there are the fields of struct mlx5e_xdp_info that are used in different flows (including AF_XDP ones): Packet origin | Fields used on completion | Cleanup steps -----------------------+---------------------------+------------------ XDP_REDIRECT, | xdpf, dma_addr | DMA unmap and XDP_TX from XSK RQ | | xdp_return_frame. -----------------------+---------------------------+------------------ XDP_TX from regular RQ | di | Recycle page. -----------------------+---------------------------+------------------ AF_XDP TX | (none) | Increment the | | producer index in | | Completion Ring. On send, the same set of mlx5e_xdp_xmit_data fields is used in all flows: DMA and virtual addresses and length. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Maxim Mikityanskiy authored
Prepare to creation of the XSK RQ, which will require posting UMRs, too. The same ICO SQ will be used for both RQs and also to trigger interrupts by posting NOPs. UMR WQEs can't be reused any more. Optimization introduced in commit ab966d7e ("net/mlx5e: RX, Recycle buffer of UMR WQEs") is reverted. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-