1. 24 Jan, 2024 23 commits
  2. 23 Jan, 2024 17 commits
    • Jose E. Marchesi's avatar
      bpf: Use r constraint instead of p constraint in selftests · bbc094b3
      Jose E. Marchesi authored
      Some of the BPF selftests use the "p" constraint in inline assembly
      snippets, for input operands for MOV (rN = rM) instructions.
      
      This is mainly done via the __imm_ptr macro defined in
      tools/testing/selftests/bpf/progs/bpf_misc.h:
      
        #define __imm_ptr(name) [name]"p"(&name)
      
      Example:
      
        int consume_first_item_only(void *ctx)
        {
              struct bpf_iter_num iter;
              asm volatile (
                      /* create iterator */
                      "r1 = %[iter];"
                      [...]
                      :
                      : __imm_ptr(iter)
                      : CLOBBERS);
              [...]
        }
      
      The "p" constraint is a tricky one.  It is documented in the GCC manual
      section "Simple Constraints":
      
        An operand that is a valid memory address is allowed.  This is for
        ``load address'' and ``push address'' instructions.
      
        p in the constraint must be accompanied by address_operand as the
        predicate in the match_operand.  This predicate interprets the mode
        specified in the match_operand as the mode of the memory reference for
        which the address would be valid.
      
      There are two problems:
      
      1. It is questionable whether that constraint was ever intended to be
         used in inline assembly templates, because its behavior really
         depends on compiler internals.  A "memory address" is not the same
         than a "memory operand" or a "memory reference" (constraint "m"), and
         in fact its usage in the template above results in an error in both
         x86_64-linux-gnu and bpf-unkonwn-none:
      
           foo.c: In function ‘bar’:
           foo.c:6:3: error: invalid 'asm': invalid expression as operand
              6 |   asm volatile ("r1 = %[jorl]" : : [jorl]"p"(&jorl));
                |   ^~~
      
         I would assume the same happens with aarch64, riscv, and most/all
         other targets in GCC, that do not accept operands of the form A + B
         that are not wrapped either in a const or in a memory reference.
      
         To avoid that error, the usage of the "p" constraint in internal GCC
         instruction templates is supposed to be complemented by the 'a'
         modifier, like in:
      
           asm volatile ("r1 = %a[jorl]" : : [jorl]"p"(&jorl));
      
         Internally documented (in GCC's final.cc) as:
      
           %aN means expect operand N to be a memory address
              (not a memory reference!) and print a reference
              to that address.
      
         That works because when the modifier 'a' is found, GCC prints an
         "operand address", which is not the same than an "operand".
      
         But...
      
      2. Even if we used the internal 'a' modifier (we shouldn't) the 'rN =
         rM' instruction really requires a register argument.  In cases
         involving automatics, like in the examples above, we easily end with:
      
           bar:
              #APP
                  r1 = r10-4
              #NO_APP
      
         In other cases we could conceibly also end with a 64-bit label that
         may overflow the 32-bit immediate operand of `rN = imm32'
         instructions:
      
              r1 = foo
      
         All of which is clearly wrong.
      
      clang happens to do "the right thing" in the current usage of __imm_ptr
      in the BPF tests, because even with -O2 it seems to "reload" the
      fp-relative address of the automatic to a register like in:
      
        bar:
      	r1 = r10
      	r1 += -4
      	#APP
      	r1 = r1
      	#NO_APP
      
      Which is what GCC would generate with -O0.  Whether this is by chance
      or by design, the compiler shouln't be expected to do that reload
      driven by the "p" constraint.
      
      This patch changes the usage of the "p" constraint in the BPF
      selftests macros to use the "r" constraint instead.  If a register is
      what is required, we should let the compiler know.
      
      Previous discussion in bpf@vger:
      https://lore.kernel.org/bpf/87h6p5ebpb.fsf@oracle.com/T/#ef0df83d6975c34dff20bf0dd52e078f5b8ca2767
      
      Tested in bpf-next master.
      No regressions.
      Signed-off-by: default avatarJose E. Marchesi <jose.marchesi@oracle.com>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240123181309.19853-1-jose.marchesi@oracle.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bbc094b3
    • Jose E. Marchesi's avatar
      bpf: fix constraint in test_tcpbpf_kern.c · 756e34da
      Jose E. Marchesi authored
      GCC emits a warning:
      
        progs/test_tcpbpf_kern.c:60:9: error: ‘op’ is used uninitialized [-Werror=uninitialized]
      
      when an uninialized op is used with a "+r" constraint.  The + modifier
      means a read-write operand, but that operand in the selftest is just
      written to.
      
      This patch changes the selftest to use a "=r" constraint.  This
      pacifies GCC.
      
      Tested in bpf-next master.
      No regressions.
      Signed-off-by: default avatarJose E. Marchesi <jose.marchesi@oracle.com>
      Cc: Yonghong Song <yhs@meta.com>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: david.faust@oracle.com
      Cc: cupertino.miranda@oracle.com
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240123205624.14746-1-jose.marchesi@oracle.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      756e34da
    • Jose E. Marchesi's avatar
      bpf: avoid VLAs in progs/test_xdp_dynptr.c · edb79903
      Jose E. Marchesi authored
      VLAs are not supported by either the BPF port of clang nor GCC.  The
      selftest test_xdp_dynptr.c contains the following code:
      
        const size_t tcphdr_sz = sizeof(struct tcphdr);
        const size_t udphdr_sz = sizeof(struct udphdr);
        const size_t ethhdr_sz = sizeof(struct ethhdr);
        const size_t iphdr_sz = sizeof(struct iphdr);
        const size_t ipv6hdr_sz = sizeof(struct ipv6hdr);
      
        [...]
      
        static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
        {
      	__u8 eth_buffer[ethhdr_sz + iphdr_sz + ethhdr_sz];
      	__u8 iph_buffer_tcp[iphdr_sz + tcphdr_sz];
      	__u8 iph_buffer_udp[iphdr_sz + udphdr_sz];
      	[...]
        }
      
      The eth_buffer, iph_buffer_tcp and other automatics are fixed size
      only if the compiler optimizes away the constant global variables.
      clang does this, but GCC does not, turning these automatics into
      variable length arrays.
      
      This patch removes the global variables and turns these values into
      preprocessor constants.  This makes the selftest to build properly
      with GCC.
      
      Tested in bpf-next master.
      No regressions.
      Signed-off-by: default avatarJose E. Marchesi <jose.marchesi@oracle.com>
      Cc: Yonghong Song <yhs@meta.com>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: david.faust@oracle.com
      Cc: cupertino.miranda@oracle.com
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240123201729.16173-1-jose.marchesi@oracle.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      edb79903
    • Andrii Nakryiko's avatar
      libbpf: call dup2() syscall directly · bc308d01
      Andrii Nakryiko authored
      We've ran into issues with using dup2() API in production setting, where
      libbpf is linked into large production environment and ends up calling
      unintended custom implementations of dup2(). These custom implementations
      don't provide atomic FD replacement guarantees of dup2() syscall,
      leading to subtle and hard to debug issues.
      
      To prevent this in the future and guarantee that no libc implementation
      will do their own custom non-atomic dup2() implementation, call dup2()
      syscall directly with syscall(SYS_dup2).
      
      Note that some architectures don't seem to provide dup2 and have dup3
      instead. Try to detect and pick best syscall.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240119210201.1295511-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bc308d01
    • Alexei Starovoitov's avatar
      Merge branch 'enable-the-inline-of-kptr_xchg-for-arm64' · c80c6434
      Alexei Starovoitov authored
      Hou Tao says:
      
      ====================
      Enable the inline of kptr_xchg for arm64
      
      From: Hou Tao <houtao1@huawei.com>
      
      Hi,
      
      The patch set is just a follow-up for "bpf: inline bpf_kptr_xchg()". It
      enables the inline of bpf_kptr_xchg() and kptr_xchg_inline test for
      arm64.
      
      Please see individual patches for more details. And comments are always
      welcome.
      ====================
      
      Link: https://lore.kernel.org/r/20240119102529.99581-1-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c80c6434
    • Hou Tao's avatar
      selftests/bpf: Enable kptr_xchg_inline test for arm64 · 29f86888
      Hou Tao authored
      Now arm64 bpf jit has enable bpf_jit_supports_ptr_xchg(), so enable
      the test for arm64 as well.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20240119102529.99581-3-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      29f86888
    • Hou Tao's avatar
      bpf, arm64: Enable the inline of bpf_kptr_xchg() · 18a45f12
      Hou Tao authored
      ARM64 bpf jit satisfies the following two conditions:
      1) support BPF_XCHG() on pointer-sized word.
      2) the implementation of xchg is the same as atomic_xchg() on
         pointer-sized words. Both of these two functions use arch_xchg() to
         implement the exchange.
      
      So enable the inline of bpf_kptr_xchg() for arm64 bpf jit.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20240119102529.99581-2-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      18a45f12
    • Dave Thaler's avatar
      bpf, docs: Clarify that MOVSX is only for BPF_X not BPF_K · 20e109ea
      Dave Thaler authored
      Per discussion on the mailing list at
      https://mailarchive.ietf.org/arch/msg/bpf/uQiqhURdtxV_ZQOTgjCdm-seh74/
      the MOVSX operation is only defined to support register extension.
      
      The document didn't previously state this and incorrectly implied
      that one could use an immediate value.
      Signed-off-by: default avatarDave Thaler <dthaler1968@gmail.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240118232954.27206-1-dthaler1968@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      20e109ea
    • Kuniyuki Iwashima's avatar
      bpf: Define struct bpf_tcp_req_attrs when CONFIG_SYN_COOKIES=n. · b3f086a7
      Kuniyuki Iwashima authored
      kernel test robot reported the warning below:
      
        >> net/core/filter.c:11842:13: warning: declaration of 'struct bpf_tcp_req_attrs' will not be visible outside of this function [-Wvisibility]
            11842 |                                         struct bpf_tcp_req_attrs *attrs, int attrs__sz)
                  |                                                ^
           1 warning generated.
      
      struct bpf_tcp_req_attrs is defined under CONFIG_SYN_COOKIES
      but used in kfunc without the config.
      
      Let's move struct bpf_tcp_req_attrs definition outside of
      CONFIG_SYN_COOKIES guard.
      
      Fixes: e472f888 ("bpf: tcp: Support arbitrary SYN Cookie.")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202401180418.CUVc0hxF-lkp@intel.com/Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240118211751.25790-1-kuniyu@amazon.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b3f086a7
    • Hao Sun's avatar
      bpf: Refactor ptr alu checking rules to allow alu explicitly · 2ce793eb
      Hao Sun authored
      Current checking rules are structured to disallow alu on particular ptr
      types explicitly, so default cases are allowed implicitly. This may lead
      to newly added ptr types being allowed unexpectedly. So restruture it to
      allow alu explicitly. The tradeoff is mainly a bit more cases added in
      the switch. The following table from Eduard summarizes the rules:
      
              | Pointer type        | Arithmetics allowed |
              |---------------------+---------------------|
              | PTR_TO_CTX          | yes                 |
              | CONST_PTR_TO_MAP    | conditionally       |
              | PTR_TO_MAP_VALUE    | yes                 |
              | PTR_TO_MAP_KEY      | yes                 |
              | PTR_TO_STACK        | yes                 |
              | PTR_TO_PACKET_META  | yes                 |
              | PTR_TO_PACKET       | yes                 |
              | PTR_TO_PACKET_END   | no                  |
              | PTR_TO_FLOW_KEYS    | conditionally       |
              | PTR_TO_SOCKET       | no                  |
              | PTR_TO_SOCK_COMMON  | no                  |
              | PTR_TO_TCP_SOCK     | no                  |
              | PTR_TO_TP_BUFFER    | yes                 |
              | PTR_TO_XDP_SOCK     | no                  |
              | PTR_TO_BTF_ID       | yes                 |
              | PTR_TO_MEM          | yes                 |
              | PTR_TO_BUF          | yes                 |
              | PTR_TO_FUNC         | yes                 |
              | CONST_PTR_TO_DYNPTR | yes                 |
      
      The refactored rules are equivalent to the original one. Note that
      PTR_TO_FUNC and CONST_PTR_TO_DYNPTR are not reject here because: (1)
      check_mem_access() rejects load/store on those ptrs, and those ptrs
      with offset passing to calls are rejected check_func_arg_reg_off();
      (2) someone may rely on the verifier not rejecting programs earily.
      Signed-off-by: default avatarHao Sun <sunhao.th@gmail.com>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20240117094012.36798-1-sunhao.th@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2ce793eb
    • Andrey Grafin's avatar
      selftest/bpf: Add map_in_maps with BPF_MAP_TYPE_PERF_EVENT_ARRAY values · 40628f9f
      Andrey Grafin authored
      Check that bpf_object__load() successfully creates map_in_maps
      with BPF_MAP_TYPE_PERF_EVENT_ARRAY values.
      These changes cover fix in the previous patch
      "libbpf: Apply map_set_def_max_entries() for inner_maps on creation".
      
      A command line output is:
      - w/o fix
      $ sudo ./test_maps
      libbpf: map 'mim_array_pe': failed to create inner map: -22
      libbpf: map 'mim_array_pe': failed to create: Invalid argument(-22)
      libbpf: failed to load object './test_map_in_map.bpf.o'
      Failed to load test prog
      
      - with fix
      $ sudo ./test_maps
      ...
      test_maps: OK, 0 SKIPPED
      
      Fixes: 646f02ff ("libbpf: Add BTF-defined map-in-map support")
      Signed-off-by: default avatarAndrey Grafin <conquistador@yandex-team.ru>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/bpf/20240117130619.9403-2-conquistador@yandex-team.ruSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      40628f9f
    • Andrey Grafin's avatar
      libbpf: Apply map_set_def_max_entries() for inner_maps on creation · f04deb90
      Andrey Grafin authored
      This patch allows to auto create BPF_MAP_TYPE_ARRAY_OF_MAPS and
      BPF_MAP_TYPE_HASH_OF_MAPS with values of BPF_MAP_TYPE_PERF_EVENT_ARRAY
      by bpf_object__load().
      
      Previous behaviour created a zero filled btf_map_def for inner maps and
      tried to use it for a map creation but the linux kernel forbids to create
      a BPF_MAP_TYPE_PERF_EVENT_ARRAY map with max_entries=0.
      
      Fixes: 646f02ff ("libbpf: Add BTF-defined map-in-map support")
      Signed-off-by: default avatarAndrey Grafin <conquistador@yandex-team.ru>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/bpf/20240117130619.9403-1-conquistador@yandex-team.ruSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f04deb90
    • Daniel Borkmann's avatar
      bpf: Sync uapi bpf.h header for the tooling infra · 091f2bf6
      Daniel Borkmann authored
      Both commit 91051f00 ("tcp: Dump bound-only sockets in inet_diag.")
      and commit 985b8ea9ec7e ("bpf, docs: Fix bpf_redirect_peer header doc")
      missed the tooling header sync. Fix it.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      091f2bf6
    • Victor Stewart's avatar
      bpf, docs: Fix bpf_redirect_peer header doc · f98df79b
      Victor Stewart authored
      Amend the bpf_redirect_peer() header documentation to also mention
      support for the netkit device type.
      Signed-off-by: default avatarVictor Stewart <v@nametag.social>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20240116202952.241009-1-v@nametag.socialSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f98df79b
    • Martin KaFai Lau's avatar
      Merge branch 'bpf: tcp: Support arbitrary SYN Cookie at TC.' · 4eaafe5a
      Martin KaFai Lau authored
      Kuniyuki Iwashima says:
      
      ====================
      Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
      for the connection request until a valid ACK is responded to the SYN+ACK.
      
      The cookie contains two kinds of host-specific bits, a timestamp and
      secrets, so only can it be validated by the generator.  It means SYN
      Cookie consumes network resources between the client and the server;
      intermediate nodes must remember which nodes to route ACK for the cookie.
      
      SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
      the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
      backend server and completes another 3WHS.  However, since the server's
      ISN differs from the cookie, the proxy must manage the ISN mappings and
      fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
      node goes down, all the connections through it are terminated.  Keeping
      a state at proxy is painful from that perspective.
      
      At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
      Our SYN Proxy consists of the front proxy layer and the backend kernel
      module.  (See slides of LPC2023 [0], p37 - p48)
      
      The cookie that SYN Proxy generates differs from the kernel's cookie in
      that it contains a secret (called rolling salt) (i) shared by all the proxy
      nodes so that any node can validate ACK and (ii) updated periodically so
      that old cookies cannot be validated and we need not encode a timestamp for
      the cookie.  Also, ISN contains WScale, SACK, and ECN, not in TS val.  This
      is not to sacrifice any connection quality, where some customers turn off
      TCP timestamps option due to retro CVE.
      
      After 3WHS, the proxy restores SYN, encapsulates ACK into SYN, and forward
      the TCP-in-TCP packet to the backend server.  Our kernel module works at
      Netfilter input/output hooks and first feeds SYN to the TCP stack to
      initiate 3WHS.  When the module is triggered for SYN+ACK, it looks up the
      corresponding request socket and overwrites tcp_rsk(req)->snt_isn with the
      proxy's cookie.  Then, the module can complete 3WHS with the original ACK
      as is.
      
      This way, our SYN Proxy does not manage the ISN mappings nor wait for
      SYN+ACK from the backend thus can remain stateless.  It's working very
      well for high-bandwidth services like multiple Tbps, but we are looking
      for a way to drop the dirty hack and further optimise the sequences.
      
      If we could validate an arbitrary SYN Cookie on the backend server with
      BPF, the proxy would need not restore SYN nor pass it.  After validating
      ACK, the proxy node just needs to forward it, and then the server can do
      the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
      and create a connection from the ACK.
      
      This series allows us to create a full sk from an arbitrary SYN Cookie,
      which is done in 3 steps.
      
        1) At tc, BPF prog calls a new kfunc to create a reqsk and configure
           it based on the argument populated from SYN Cookie.  The reqsk has
           its listener as req->rsk_listener and is passed to the TCP stack as
           skb->sk.
      
        2) During TCP socket lookup for the skb, skb_steal_sock() returns a
           listener in the reuseport group that inet_reqsk(skb->sk)->rsk_listener
           belongs to.
      
        3) In cookie_v[46]_check(), the reqsk (skb->sk) is fully initialised and
           a full sk is created.
      
      The kfunc usage is as follows:
      
          struct bpf_tcp_req_attrs attrs = {
              .mss = mss,
              .wscale_ok = wscale_ok,
              .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
              .snd_wscale = snd_wscale, /* Client's WScale < 15 */
              .tstamp_ok = tstamp_ok,
              .rcv_tsval = tsval,
              .rcv_tsecr = tsecr, /* Server's Initial TSval */
              .usec_ts_ok = usec_ts_ok,
              .sack_ok = sack_ok,
              .ecn_ok = ecn_ok,
          }
      
          skc = bpf_skc_lookup_tcp(...);
          sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
          bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
          bpf_sk_release(skc);
      
      [0]: https://lpc.events/event/17/contributions/1645/attachments/1350/2701/SYN_Proxy_at_Scale_with_BPF.pdf
      
      Changes:
        v8
          * Rebase on Yonghong's cpuv4 fix
          * Patch 5
            * Fill the trailing 3-bytes padding in struct bpf_tcp_req_attrs
              and test it as null
          * Patch 6
            * Remove unused IPPROTP_MPTCP definition
      
        v7: https://lore.kernel.org/bpf/20231221012806.37137-1-kuniyu@amazon.com/
          * Patch 5 & 6
            * Drop MPTCP support
      
        v6: https://lore.kernel.org/bpf/20231214155424.67136-1-kuniyu@amazon.com/
          * Patch 5 & 6
            * /struct /s/tcp_cookie_attributes/bpf_tcp_req_attrs/
            * Don't reuse struct tcp_options_received and use u8 for each attrs
          * Patch 6
            * Check retval of test__start_subtest()
      
        v5: https://lore.kernel.org/netdev/20231211073650.90819-1-kuniyu@amazon.com/
          * Split patch 1-3
          * Patch 3
            * Clear req->rsk_listener in skb_steal_sock()
          * Patch 4 & 5
            * Move sysctl validation and tsoff init from cookie_bpf_check() to kfunc
          * Patch 5
            * Do not increment LINUX_MIB_SYNCOOKIES(RECV|FAILED)
          * Patch 6
            * Remove __always_inline
            * Test if tcp_handle_{syn,ack}() is executed
            * Move some definition to bpf_tracing_net.h
            * s/BPF_F_CURRENT_NETNS/-1/
      
        v4: https://lore.kernel.org/bpf/20231205013420.88067-1-kuniyu@amazon.com/
          * Patch 1 & 2
            * s/CONFIG_SYN_COOKIE/CONFIG_SYN_COOKIES/
          * Patch 1
            * Don't set rcv_wscale for BPF SYN Cookie case.
          * Patch 2
            * Add test for tcp_opt.{unused,rcv_wscale} in kfunc
            * Modify skb_steal_sock() to avoid resetting skb-sk
            * Support SO_REUSEPORT lookup
          * Patch 3
            * Add CONFIG_SYN_COOKIES to Kconfig for CI
            * Define BPF_F_CURRENT_NETNS
      
        v3: https://lore.kernel.org/netdev/20231121184245.69569-1-kuniyu@amazon.com/
          * Guard kfunc and req->syncookie part in inet6?_steal_sock() with
            CONFIG_SYN_COOKIE
      
        v2: https://lore.kernel.org/netdev/20231120222341.54776-1-kuniyu@amazon.com/
          * Drop SOCK_OPS and move SYN Cookie validation logic to TC with kfunc.
          * Add cleanup patches to reduce discrepancy between cookie_v[46]_check()
      
        v1: https://lore.kernel.org/bpf/20231013220433.70792-1-kuniyu@amazon.com/
      ====================
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4eaafe5a
    • Kuniyuki Iwashima's avatar
      selftest: bpf: Test bpf_sk_assign_tcp_reqsk(). · a7471224
      Kuniyuki Iwashima authored
      This commit adds a sample selftest to demonstrate how we can use
      bpf_sk_assign_tcp_reqsk() as the backend of SYN Proxy.
      
      The test creates IPv4/IPv6 x TCP connections and transfer messages
      over them on lo with BPF tc prog attached.
      
      The tc prog will process SYN and returns SYN+ACK with the following
      ISN and TS.  In a real use case, this part will be done by other
      hosts.
      
              MSB                                   LSB
        ISN:  | 31 ... 8 | 7 6 |   5 |    4 | 3 2 1 0 |
              |   Hash_1 | MSS | ECN | SACK |  WScale |
      
        TS:   | 31 ... 8 |          7 ... 0           |
              |   Random |           Hash_2           |
      
        WScale in SYN is reused in SYN+ACK.
      
      The client returns ACK, and tc prog will recalculate ISN and TS
      from ACK and validate SYN Cookie.
      
      If it's valid, the prog calls kfunc to allocate a reqsk for skb and
      configure the reqsk based on the argument created from SYN Cookie.
      
      Later, the reqsk will be processed in cookie_v[46]_check() to create
      a connection.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240115205514.68364-7-kuniyu@amazon.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a7471224
    • Kuniyuki Iwashima's avatar
      bpf: tcp: Support arbitrary SYN Cookie. · e472f888
      Kuniyuki Iwashima authored
      This patch adds a new kfunc available at TC hook to support arbitrary
      SYN Cookie.
      
      The basic usage is as follows:
      
          struct bpf_tcp_req_attrs attrs = {
              .mss = mss,
              .wscale_ok = wscale_ok,
              .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
              .snd_wscale = snd_wscale, /* Client's WScale < 15 */
              .tstamp_ok = tstamp_ok,
              .rcv_tsval = tsval,
              .rcv_tsecr = tsecr, /* Server's Initial TSval */
              .usec_ts_ok = usec_ts_ok,
              .sack_ok = sack_ok,
              .ecn_ok = ecn_ok,
          }
      
          skc = bpf_skc_lookup_tcp(...);
          sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
          bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
          bpf_sk_release(skc);
      
      bpf_sk_assign_tcp_reqsk() takes skb, a listener sk, and struct
      bpf_tcp_req_attrs and allocates reqsk and configures it.  Then,
      bpf_sk_assign_tcp_reqsk() links reqsk with skb and the listener.
      
      The notable thing here is that we do not hold refcnt for both reqsk
      and listener.  To differentiate that, we mark reqsk->syncookie, which
      is only used in TX for now.  So, if reqsk->syncookie is 1 in RX, it
      means that the reqsk is allocated by kfunc.
      
      When skb is freed, sock_pfree() checks if reqsk->syncookie is 1,
      and in that case, we set NULL to reqsk->rsk_listener before calling
      reqsk_free() as reqsk does not hold a refcnt of the listener.
      
      When the TCP stack looks up a socket from the skb, we steal the
      listener from the reqsk in skb_steal_sock() and create a full sk
      in cookie_v[46]_check().
      
      The refcnt of reqsk will finally be set to 1 in tcp_get_cookie_sock()
      after creating a full sk.
      
      Note that we can extend struct bpf_tcp_req_attrs in the future when
      we add a new attribute that is determined in 3WHS.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240115205514.68364-6-kuniyu@amazon.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e472f888