1. 25 Jan, 2021 5 commits
  2. 22 Jan, 2021 3 commits
  3. 21 Jan, 2021 4 commits
  4. 20 Jan, 2021 28 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: misc performance improvements for cgroup' · 636d549f
      Alexei Starovoitov authored
      Stanislav Fomichev says:
      
      ====================
      
      First patch adds custom getsockopt for TCP_ZEROCOPY_RECEIVE
      to remove kmalloc and lock_sock overhead from the dat path.
      
      Second patch removes kzalloc/kfree from getsockopt for the common cases.
      
      Third patch switches cgroup_bpf_enabled to be per-attach to
      to add only overhead for the cgroup attach types used on the system.
      
      No visible user-side changes.
      
      v9:
      - include linux/tcp.h instead of netinet/tcp.h in sockopt_sk.c
      - note that v9 depends on the commit 4be34f3d ("bpf: Don't leak
        memory in bpf getsockopt when optlen == 0") from bpf tree
      
      v8:
      - add bpi.h to tools/include/uapi in the same patch (Martin KaFai Lau)
      - kmalloc instead of kzalloc when exporting buffer (Martin KaFai Lau)
      - note that v8 depends on the commit 4be34f3d ("bpf: Don't leak
        memory in bpf getsockopt when optlen == 0") from bpf tree
      
      v7:
      - add comment about buffer contents for retval != 0 (Martin KaFai Lau)
      - export tcp.h into tools/include/uapi (Martin KaFai Lau)
      - note that v7 depends on the commit 4be34f3d ("bpf: Don't leak
        memory in bpf getsockopt when optlen == 0") from bpf tree
      
      v6:
      - avoid indirect cost for new bpf_bypass_getsockopt (Eric Dumazet)
      
      v5:
      - reorder patches to reduce the churn (Martin KaFai Lau)
      
      v4:
      - update performance numbers
      - bypass_bpf_getsockopt (Martin KaFai Lau)
      
      v3:
      - remove extra newline, add comment about sizeof tcp_zerocopy_receive
        (Martin KaFai Lau)
      - add another patch to remove lock_sock overhead from
        TCP_ZEROCOPY_RECEIVE; technically, this makes patch #1 obsolete,
        but I'd still prefer to keep it to help with other socket
        options
      
      v2:
      - perf numbers for getsockopt kmalloc reduction (Song Liu)
      - (sk) in BPF_CGROUP_PRE_CONNECT_ENABLED (Song Liu)
      - 128 -> 64 buffer size, BUILD_BUG_ON (Martin KaFai Lau)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      636d549f
    • Stanislav Fomichev's avatar
      bpf: Split cgroup_bpf_enabled per attach type · a9ed15da
      Stanislav Fomichev authored
      When we attach any cgroup hook, the rest (even if unused/unattached) start
      to contribute small overhead. In particular, the one we want to avoid is
      __cgroup_bpf_run_filter_skb which does two redirections to get to
      the cgroup and pushes/pulls skb.
      
      Let's split cgroup_bpf_enabled to be per-attach to make sure
      only used attach types trigger.
      
      I've dropped some existing high-level cgroup_bpf_enabled in some
      places because BPF_PROG_CGROUP_XXX_RUN macros usually have another
      cgroup_bpf_enabled check.
      
      I also had to copy-paste BPF_CGROUP_RUN_SA_PROG_LOCK for
      GETPEERNAME/GETSOCKNAME because type for cgroup_bpf_enabled[type]
      has to be constant and known at compile time.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20210115163501.805133-4-sdf@google.com
      a9ed15da
    • Stanislav Fomichev's avatar
      bpf: Try to avoid kzalloc in cgroup/{s,g}etsockopt · 20f2505f
      Stanislav Fomichev authored
      When we attach a bpf program to cgroup/getsockopt any other getsockopt()
      syscall starts incurring kzalloc/kfree cost.
      
      Let add a small buffer on the stack and use it for small (majority)
      {s,g}etsockopt values. The buffer is small enough to fit into
      the cache line and cover the majority of simple options (most
      of them are 4 byte ints).
      
      It seems natural to do the same for setsockopt, but it's a bit more
      involved when the BPF program modifies the data (where we have to
      kmalloc). The assumption is that for the majority of setsockopt
      calls (which are doing pure BPF options or apply policy) this
      will bring some benefit as well.
      
      Without this patch (we remove about 1% __kmalloc):
           3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
                  |
                   --3.30%--__cgroup_bpf_run_filter_getsockopt
                             |
                              --0.81%--__kmalloc
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210115163501.805133-3-sdf@google.com
      20f2505f
    • Stanislav Fomichev's avatar
      bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE · 9cacf81f
      Stanislav Fomichev authored
      Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
      We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
      call in do_tcp_getsockopt using the on-stack data. This removes
      3% overhead for locking/unlocking the socket.
      
      Without this patch:
           3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
                  |
                   --3.30%--__cgroup_bpf_run_filter_getsockopt
                             |
                              --0.81%--__kmalloc
      
      With the patch applied:
           0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
      
      Note, exporting uapi/tcp.h requires removing netinet/tcp.h
      from test_progs.h because those headers have confliciting
      definitions.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
      9cacf81f
    • Yonghong Song's avatar
      bpf: Permit size-0 datasec · 13ca51d5
      Yonghong Song authored
      llvm patch https://reviews.llvm.org/D84002 permitted
      to emit empty rodata datasec if the elf .rodata section
      contains read-only data from local variables. These
      local variables will be not emitted as BTF_KIND_VARs
      since llvm converted these local variables as
      static variables with private linkage without debuginfo
      types. Such an empty rodata datasec will make
      skeleton code generation easy since for skeleton
      a rodata struct will be generated if there is a
      .rodata elf section. The existence of a rodata
      btf datasec is also consistent with the existence
      of a rodata map created by libbpf.
      
      The btf with such an empty rodata datasec will fail
      in the kernel though as kernel will reject a datasec
      with zero vlen and zero size. For example, for the below code,
          int sys_enter(void *ctx)
          {
             int fmt[6] = {1, 2, 3, 4, 5, 6};
             int dst[6];
      
             bpf_probe_read(dst, sizeof(dst), fmt);
             return 0;
          }
      We got the below btf (bpftool btf dump ./test.o):
          [1] PTR '(anon)' type_id=0
          [2] FUNC_PROTO '(anon)' ret_type_id=3 vlen=1
                  'ctx' type_id=1
          [3] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
          [4] FUNC 'sys_enter' type_id=2 linkage=global
          [5] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED
          [6] ARRAY '(anon)' type_id=5 index_type_id=7 nr_elems=4
          [7] INT '__ARRAY_SIZE_TYPE__' size=4 bits_offset=0 nr_bits=32 encoding=(none)
          [8] VAR '_license' type_id=6, linkage=global-alloc
          [9] DATASEC '.rodata' size=0 vlen=0
          [10] DATASEC 'license' size=0 vlen=1
                  type_id=8 offset=0 size=4
      When loading the ./test.o to the kernel with bpftool,
      we see the following error:
          libbpf: Error loading BTF: Invalid argument(22)
          libbpf: magic: 0xeb9f
          ...
          [6] ARRAY (anon) type_id=5 index_type_id=7 nr_elems=4
          [7] INT __ARRAY_SIZE_TYPE__ size=4 bits_offset=0 nr_bits=32 encoding=(none)
          [8] VAR _license type_id=6 linkage=1
          [9] DATASEC .rodata size=24 vlen=0 vlen == 0
          libbpf: Error loading .BTF into kernel: -22. BTF is optional, ignoring.
      
      Basically, libbpf changed .rodata datasec size to 24 since elf .rodata
      section size is 24. The kernel then rejected the BTF since vlen = 0.
      Note that the above kernel verifier failure can be worked around with
      changing local variable "fmt" to a static or global, optionally const, variable.
      
      This patch permits a datasec with vlen = 0 in kernel.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210119153519.3901963-1-yhs@fb.com
      13ca51d5
    • Alexei Starovoitov's avatar
      Merge branch 'Allow attaching to bare tracepoints' · 71ee10e2
      Alexei Starovoitov authored
      Qais Yousef says:
      
      ====================
      
      Changes in v3:
      	* Fix not returning error value correctly in
      	  trigger_module_test_write() (Yonghong)
      	* Add Yonghong acked-by to patch 1.
      
      Changes in v2:
      	* Fix compilation error. (Andrii)
      	* Make the new test use write() instead of read() (Andrii)
      
      Add some missing glue logic to teach bpf about bare tracepoints - tracepoints
      without any trace event associated with them.
      
      Bare tracepoints are declare with DECLARE_TRACE(). Full tracepoints are declare
      with TRACE_EVENT().
      
      BPF can attach to these tracepoints as RAW_TRACEPOINT() only as there're no
      events in tracefs created with them.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      71ee10e2
    • Qais Yousef's avatar
      selftests: bpf: Add a new test for bare tracepoints · 407be922
      Qais Yousef authored
      Reuse module_attach infrastructure to add a new bare tracepoint to check
      we can attach to it as a raw tracepoint.
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210119122237.2426878-3-qais.yousef@arm.com
      407be922
    • Alexei Starovoitov's avatar
      Merge branch 'bpf,x64: implement jump padding in jit' · 86e6b4e9
      Alexei Starovoitov authored
      Gary Lin says:
      
      ====================
      This patch series implements jump padding to x64 jit to cover some
      corner cases that used to consume more than 20 jit passes and caused
      failure.
      
      v4:
        - Add the detailed comments about the possible padding bytes
        - Add the second test case which triggers jmp_cond padding and imm32 nop
          jmp padding.
        - Add the new test case as another subprog
      
      v3:
        - Copy the instructions of prologue separately or the size calculation
          of the first BPF instruction would include the prologue.
        - Replace WARN_ONCE() with pr_err() and EFAULT
        - Use MAX_PASSES in the for loop condition check
        - Remove the "padded" flag from x64_jit_data. For the extra pass of
          subprogs, padding is always enabled since it won't hurt the images
          that converge without padding.
      v2:
        - Simplify the sample code in the commit description and provide the
          jit code
        - Check the expected padding bytes with WARN_ONCE
        - Move the 'padded' flag to 'struct x64_jit_data'
        - Remove the EXPECTED_FAIL flag from bpf_fill_maxinsns11() in test_bpf
        - Add 2 verifier tests
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      86e6b4e9
    • Qais Yousef's avatar
      trace: bpf: Allow bpf to attach to bare tracepoints · 6939f4ef
      Qais Yousef authored
      Some subsystems only have bare tracepoints (a tracepoint with no
      associated trace event) to avoid the problem of trace events being an
      ABI that can't be changed.
      
      >From bpf presepective, bare tracepoints are what it calls
      RAW_TRACEPOINT().
      
      Since bpf assumed there's 1:1 mapping, it relied on hooking to
      DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
      bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
      no knowledge about their existence.
      
      By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
      DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.
      
      Enabling that comes with the contract that changes to raw tracepoints
      don't constitute a regression if they break existing bpf programs.
      We need the ability to continue to morph and modify these raw
      tracepoints without worrying about any ABI.
      
      Update Documentation/bpf/bpf_design_QA.rst to document this contract.
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20210119122237.2426878-2-qais.yousef@arm.com
      6939f4ef
    • Gary Lin's avatar
      selftests/bpf: Add verifier tests for x64 jit jump padding · 79d1b684
      Gary Lin authored
      There are 3 tests added into verifier's jit tests to trigger x64
      jit jump padding.
      
      The first test can be represented as the following assembly code:
      
            1: bpf_call bpf_get_prandom_u32
            2: if r0 == 1 goto pc+128
            3: if r0 == 2 goto pc+128
               ...
          129: if r0 == 128 goto pc+128
          130: goto pc+128
          131: goto pc+127
               ...
          256: goto pc+2
          257: goto pc+1
          258: r0 = 1
          259: ret
      
      We first store a random number to r0 and add the corresponding
      conditional jumps (2~129) to make verifier believe that those jump
      instructions from 130 to 257 are reachable. When the program is sent to
      x64 jit, it starts to optimize out the NOP jumps backwards from 257.
      Since there are 128 such jumps, the program easily reaches 15 passes and
      triggers jump padding.
      
      Here is the x64 jit code of the first test:
      
            0:    0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
            5:    66 90                   xchg   ax,ax
            7:    55                      push   rbp
            8:    48 89 e5                mov    rbp,rsp
            b:    e8 4c 90 75 e3          call   0xffffffffe375905c
           10:    48 83 f8 01             cmp    rax,0x1
           14:    0f 84 fe 04 00 00       je     0x518
           1a:    48 83 f8 02             cmp    rax,0x2
           1e:    0f 84 f9 04 00 00       je     0x51d
            ...
           f6:    48 83 f8 18             cmp    rax,0x18
           fa:    0f 84 8b 04 00 00       je     0x58b
          100:    48 83 f8 19             cmp    rax,0x19
          104:    0f 84 86 04 00 00       je     0x590
          10a:    48 83 f8 1a             cmp    rax,0x1a
          10e:    0f 84 81 04 00 00       je     0x595
            ...
          500:    0f 84 83 01 00 00       je     0x689
          506:    48 81 f8 80 00 00 00    cmp    rax,0x80
          50d:    0f 84 76 01 00 00       je     0x689
          513:    e9 71 01 00 00          jmp    0x689
          518:    e9 6c 01 00 00          jmp    0x689
            ...
          5fe:    e9 86 00 00 00          jmp    0x689
          603:    e9 81 00 00 00          jmp    0x689
          608:    0f 1f 00                nop    DWORD PTR [rax]
          60b:    eb 7c                   jmp    0x689
          60d:    eb 7a                   jmp    0x689
            ...
          683:    eb 04                   jmp    0x689
          685:    eb 02                   jmp    0x689
          687:    66 90                   xchg   ax,ax
          689:    b8 01 00 00 00          mov    eax,0x1
          68e:    c9                      leave
          68f:    c3                      ret
      
      As expected, a 3 bytes NOPs is inserted at 608 due to the transition
      from imm32 jmp to imm8 jmp. A 2 bytes NOPs is also inserted at 687 to
      replace a NOP jump.
      
      The second test case is tricky. Here is the assembly code:
      
             1: bpf_call bpf_get_prandom_u32
             2: if r0 == 1 goto pc+2048
             3: if r0 == 2 goto pc+2048
             ...
          2049: if r0 == 2048 goto pc+2048
          2050: goto pc+2048
          2051: goto pc+16
          2052: goto pc+15
             ...
          2064: goto pc+3
          2065: goto pc+2
          2066: goto pc+1
             ...
             [repeat "goto pc+16".."goto pc+1" 127 times]
             ...
          4099: r0 = 2
          4100: ret
      
      There are 4 major parts of the program.
      1) 1~2049: Those are instructions to make 2050~4098 reachable. Some of
                 them also could generate the padding for jmp_cond.
      2) 2050: This is the target instruction for the imm32 nop jmp padding.
      3) 2051~4098: The repeated "goto 1~16" instructions are designed to be
                    consumed by the nop jmp optimization. In the end, those
                    instrucitons become 128 continuous 0 offset jmp and are
                    optimized out in 1 pass, and this make insn 2050 an imm32
                    nop jmp in the next pass, so that we can trigger the
                    5 bytes padding.
      4) 4099~4100: Those are the instructions to end the program.
      
      The x64 jit code is like this:
      
             0:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
             5:       66 90                   xchg   ax,ax
             7:       55                      push   rbp
             8:       48 89 e5                mov    rbp,rsp
             b:       e8 bc 7b d5 d3          call   0xffffffffd3d57bcc
            10:       48 83 f8 01             cmp    rax,0x1
            14:       0f 84 7e 66 00 00       je     0x6698
            1a:       48 83 f8 02             cmp    rax,0x2
            1e:       0f 84 74 66 00 00       je     0x6698
            24:       48 83 f8 03             cmp    rax,0x3
            28:       0f 84 6a 66 00 00       je     0x6698
            2e:       48 83 f8 04             cmp    rax,0x4
            32:       0f 84 60 66 00 00       je     0x6698
            38:       48 83 f8 05             cmp    rax,0x5
            3c:       0f 84 56 66 00 00       je     0x6698
            42:       48 83 f8 06             cmp    rax,0x6
            46:       0f 84 4c 66 00 00       je     0x6698
            ...
          666c:       48 81 f8 fe 07 00 00    cmp    rax,0x7fe
          6673:       0f 1f 40 00             nop    DWORD PTR [rax+0x0]
          6677:       74 1f                   je     0x6698
          6679:       48 81 f8 ff 07 00 00    cmp    rax,0x7ff
          6680:       0f 1f 40 00             nop    DWORD PTR [rax+0x0]
          6684:       74 12                   je     0x6698
          6686:       48 81 f8 00 08 00 00    cmp    rax,0x800
          668d:       0f 1f 40 00             nop    DWORD PTR [rax+0x0]
          6691:       74 05                   je     0x6698
          6693:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
          6698:       b8 02 00 00 00          mov    eax,0x2
          669d:       c9                      leave
          669e:       c3                      ret
      
      Since insn 2051~4098 are optimized out right before the padding pass,
      there are several conditional jumps from the first part are replaced with
      imm8 jmp_cond, and this triggers the 4 bytes padding, for example at
      6673, 6680, and 668d. On the other hand, Insn 2050 is replaced with the
      5 bytes nops at 6693.
      
      The third test is to invoke the first and second tests as subprogs to test
      bpf2bpf. Per the system log, there was one more jit happened with only
      one pass and the same jit code was produced.
      
      v4:
        - Add the second test case which triggers jmp_cond padding and imm32 nop
          jmp padding.
        - Add the new test case as another subprog
      Signed-off-by: default avatarGary Lin <glin@suse.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210119102501.511-4-glin@suse.com
      79d1b684
    • Gary Lin's avatar
      test_bpf: Remove EXPECTED_FAIL flag from bpf_fill_maxinsns11 · 16a660ef
      Gary Lin authored
      With NOPs padding, x64 jit now can handle the jump cases like
      bpf_fill_maxinsns11().
      Signed-off-by: default avatarGary Lin <glin@suse.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210119102501.511-3-glin@suse.com
      16a660ef
    • Gary Lin's avatar
      bpf,x64: Pad NOPs to make images converge more easily · 93c5aecc
      Gary Lin authored
      The x64 bpf jit expects bpf images converge within the given passes, but
      it could fail to do so with some corner cases. For example:
      
            l0:     ja 40
            l1:     ja 40
      
              [... repeated ja 40 ]
      
            l39:    ja 40
            l40:    ret #0
      
      This bpf program contains 40 "ja 40" instructions which are effectively
      NOPs and designed to be replaced with valid code dynamically. Ideally,
      bpf jit should optimize those "ja 40" instructions out when translating
      the bpf instructions into x64 machine code. However, do_jit() can only
      remove one "ja 40" for offset==0 on each pass, so it requires at least
      40 runs to eliminate those JMPs and exceeds the current limit of
      passes(20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
      is set even though it's legit as a classic socket filter.
      
      To make bpf images more likely converge within 20 passes, this commit
      pads some instructions with NOPs in the last 5 passes:
      
      1. conditional jumps
        A possible size variance comes from the adoption of imm8 JMP. If the
        offset is imm8, we calculate the size difference of this BPF instruction
        between the previous and the current pass and fill the gap with NOPs.
        To avoid the recalculation of jump offset, those NOPs are inserted before
        the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
        calculating the NOP number.
      
      2. BPF_JA
        There are two conditions for BPF_JA.
        a.) nop jumps
          If this instruction is not optimized out in the previous pass,
          instead of removing it, we insert the equivalent size of NOPs.
        b.) label jumps
          Similar to condition jumps, we prepend NOPs right before the JMP
          code.
      
      To make the code concise, emit_nops() is modified to use the signed len and
      return the number of inserted NOPs.
      
      For bpf-to-bpf, we always enable padding for the extra pass since there
      is only one extra run and the jump padding doesn't affected the images
      that converge without padding.
      
      After applying this patch, the corner case was loaded with the following
      jit code:
      
          flen=45 proglen=77 pass=17 image=ffffffffc03367d4 from=jump pid=10097
          JIT code: 00000000: 0f 1f 44 00 00 55 48 89 e5 53 41 55 31 c0 45 31
          JIT code: 00000010: ed 48 89 fb eb 30 eb 2e eb 2c eb 2a eb 28 eb 26
          JIT code: 00000020: eb 24 eb 22 eb 20 eb 1e eb 1c eb 1a eb 18 eb 16
          JIT code: 00000030: eb 14 eb 12 eb 10 eb 0e eb 0c eb 0a eb 08 eb 06
          JIT code: 00000040: eb 04 eb 02 66 90 31 c0 41 5d 5b c9 c3
      
           0: 0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
           5: 55                      push   rbp
           6: 48 89 e5                mov    rbp,rsp
           9: 53                      push   rbx
           a: 41 55                   push   r13
           c: 31 c0                   xor    eax,eax
           e: 45 31 ed                xor    r13d,r13d
          11: 48 89 fb                mov    rbx,rdi
          14: eb 30                   jmp    0x46
          16: eb 2e                   jmp    0x46
              ...
          3e: eb 06                   jmp    0x46
          40: eb 04                   jmp    0x46
          42: eb 02                   jmp    0x46
          44: 66 90                   xchg   ax,ax
          46: 31 c0                   xor    eax,eax
          48: 41 5d                   pop    r13
          4a: 5b                      pop    rbx
          4b: c9                      leave
          4c: c3                      ret
      
      At the 16th pass, 15 jumps were already optimized out, and one jump was
      replaced with NOPs at 44 and the image converged at the 17th pass.
      
      v4:
        - Add the detailed comments about the possible padding bytes
      
      v3:
        - Copy the instructions of prologue separately or the size calculation
          of the first BPF instruction would include the prologue.
        - Replace WARN_ONCE() with pr_err() and EFAULT
        - Use MAX_PASSES in the for loop condition check
        - Remove the "padded" flag from x64_jit_data. For the extra pass of
          subprogs, padding is always enabled since it won't hurt the images
          that converge without padding.
      
      v2:
        - Simplify the sample code in the description and provide the jit code
        - Check the expected padding bytes with WARN_ONCE
        - Move the 'padded' flag to 'struct x64_jit_data'
      Signed-off-by: default avatarGary Lin <glin@suse.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210119102501.511-2-glin@suse.com
      93c5aecc
    • Lukas Bulwahn's avatar
      docs, bpf: Add minimal markup to address doc warning · d2e04b9d
      Lukas Bulwahn authored
      Commit 91c960b0 ("bpf: Rename BPF_XADD and prepare to encode other
      atomics in .imm") modified the BPF documentation, but missed some ReST
      markup.
      
      Hence, make htmldocs warns on Documentation/networking/filter.rst:1053:
      
        WARNING: Inline emphasis start-string without end-string.
      
      Add some minimal markup to address this warning.
      
      Fixes: 91c960b0 ("bpf: Rename BPF_XADD and prepare to encode other atomics in .imm")
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarBrendan Jackman <jackmanb@google.com>
      Link: https://lore.kernel.org/bpf/20210118080004.6367-1-lukas.bulwahn@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d2e04b9d
    • Björn Töpel's avatar
      samples/bpf: Add BPF_ATOMIC_OP macro for BPF samples · da9d35e2
      Björn Töpel authored
      Brendan Jackman added extend atomic operations to the BPF instruction
      set in commit 7064a734 ("Merge branch 'Atomics for eBPF'"), which
      introduces the BPF_ATOMIC_OP macro. However, that macro was missing
      for the BPF samples. Fix that by adding it into bpf_insn.h.
      
      Fixes: 91c960b0 ("bpf: Rename BPF_XADD and prepare to encode other atomics in .imm")
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarBrendan Jackman <jackmanb@google.com>
      Link: https://lore.kernel.org/bpf/20210118091753.107572-1-bjorn.topel@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      da9d35e2
    • Lorenzo Bianconi's avatar
      net, xdp: Introduce xdp_build_skb_from_frame utility routine · 89f479f0
      Lorenzo Bianconi authored
      Introduce xdp_build_skb_from_frame utility routine to build the skb
      from xdp_frame. Respect to __xdp_build_skb_from_frame,
      xdp_build_skb_from_frame will allocate the skb object. Rely on
      xdp_build_skb_from_frame in veth driver.
      Introduce missing xdp metadata support in veth_xdp_rcv_one routine.
      Add missing metadata support in veth_xdp_rcv_one().
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarToshiaki Makita <toshiaki.makita1@gmail.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/94ade9e853162ae1947941965193190da97457bc.1610475660.git.lorenzo@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      89f479f0
    • Lorenzo Bianconi's avatar
      net, xdp: Introduce __xdp_build_skb_from_frame utility routine · 97a0e1ea
      Lorenzo Bianconi authored
      Introduce __xdp_build_skb_from_frame utility routine to build
      the skb from xdp_frame. Rely on __xdp_build_skb_from_frame in
      cpumap code.
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/4f9f4c6b3dd3933770c617eb6689dbc0c6e25863.1610475660.git.lorenzo@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      97a0e1ea
    • Carlos Neira's avatar
      bpf, selftests: Fold test_current_pid_tgid_new_ns into test_progs. · 09c02d55
      Carlos Neira authored
      Currently tests for bpf_get_ns_current_pid_tgid() are outside test_progs.
      This change folds test cases into test_progs.
      
      Changes from v11:
      
       - Fixed test failure is not detected.
       - Removed EXIT(3) call as it will stop test_progs execution.
      Signed-off-by: default avatarCarlos Neira <cneirabustos@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210114141033.GA17348@localhostSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      09c02d55
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 0fe2f273
      Jakub Kicinski authored
      Conflicts:
      
      drivers/net/can/dev.c
        commit 03f16c50 ("can: dev: can_restart: fix use after free bug")
        commit 3e77f70e ("can: dev: move driver related infrastructure into separate subdir")
      
        Code move.
      
      drivers/net/dsa/b53/b53_common.c
       commit 8e4052c3 ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
       commit b7a9e0da ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")
      
       Field rename.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0fe2f273
    • Linus Torvalds's avatar
      Merge tag 'net-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 75439bc4
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Networking fixes for 5.11-rc5, including fixes from bpf, wireless, and
        can trees.
      
        Current release - regressions:
      
         - nfc: nci: fix the wrong NCI_CORE_INIT parameters
      
        Current release - new code bugs:
      
         - bpf: allow empty module BTFs
      
        Previous releases - regressions:
      
         - bpf: fix signed_{sub,add32}_overflows type handling
      
         - tcp: do not mess with cloned skbs in tcp_add_backlog()
      
         - bpf: prevent double bpf_prog_put call from bpf_tracing_prog_attach
      
         - bpf: don't leak memory in bpf getsockopt when optlen == 0
      
         - tcp: fix potential use-after-free due to double kfree()
      
         - mac80211: fix encryption issues with WEP
      
         - devlink: use right genl user_ptr when handling port param get/set
      
         - ipv6: set multicast flag on the multicast route
      
         - tcp: fix TCP_USER_TIMEOUT with zero window
      
        Previous releases - always broken:
      
         - bpf: local storage helpers should check nullness of owner ptr passed
      
         - mac80211: fix incorrect strlen of .write in debugfs
      
         - cls_flower: call nla_ok() before nla_next()
      
         - skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too"
      
      * tag 'net-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits)
        net: systemport: free dev before on error path
        net: usb: cdc_ncm: don't spew notifications
        net: mscc: ocelot: Fix multicast to the CPU port
        tcp: Fix potential use-after-free due to double kfree()
        bpf: Fix signed_{sub,add32}_overflows type handling
        can: peak_usb: fix use after free bugs
        can: vxcan: vxcan_xmit: fix use after free bug
        can: dev: can_restart: fix use after free bug
        tcp: fix TCP socket rehash stats mis-accounting
        net: dsa: b53: fix an off by one in checking "vlan->vid"
        tcp: do not mess with cloned skbs in tcp_add_backlog()
        selftests: net: fib_tests: remove duplicate log test
        net: nfc: nci: fix the wrong NCI_CORE_INIT parameters
        sh_eth: Fix power down vs. is_opened flag ordering
        net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
        netfilter: rpfilter: mask ecn bits before fib lookup
        udp: mask TOS bits in udp_v4_early_demux()
        xsk: Clear pool even for inactive queues
        bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
        sh_eth: Make PHY access aware of Runtime PM to fix reboot crash
        ...
      75439bc4
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 2e4ceed6
      Linus Torvalds authored
      Pull xen fix from Juergen Gross:
       "A fix for build failure showing up in some configurations"
      
      * tag 'for-linus-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        x86/xen: fix 'nopvspin' build error
      2e4ceed6
    • Tianjia Zhang's avatar
      X.509: Fix crash caused by NULL pointer · 7178a107
      Tianjia Zhang authored
      On the following call path, `sig->pkey_algo` is not assigned
      in asymmetric_key_verify_signature(), which causes runtime
      crash in public_key_verify_signature().
      
        keyctl_pkey_verify
          asymmetric_key_verify_signature
            verify_signature
              public_key_verify_signature
      
      This patch simply check this situation and fixes the crash
      caused by NULL pointer.
      
      Fixes: 21552563 ("X.509: support OSCCA SM2-with-SM3 certificate verification")
      Reported-by: default avatarTobias Markus <tobias@markus-regensburg.de>
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-and-tested-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Tested-by: default avatarJoão Fonseca <jpedrofonseca@ua.pt>
      Acked-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Cc: stable@vger.kernel.org # v5.10+
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7178a107
    • Takashi Iwai's avatar
      cachefiles: Drop superfluous readpages aops NULL check · db58465f
      Takashi Iwai authored
      After the recent actions to convert readpages aops to readahead, the
      NULL checks of readpages aops in cachefiles_read_or_alloc_page() may
      hit falsely.  More badly, it's an ASSERT() call, and this panics.
      
      Drop the superfluous NULL checks for fixing this regression.
      
      [DH: Note that cachefiles never actually used readpages, so this check was
       never actually necessary]
      
      BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208883
      BugLink: https://bugzilla.opensuse.org/show_bug.cgi?id=1175245
      Fixes: 9ae326a6 ("CacheFiles: A cache that backs onto a mounted filesystem")
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db58465f
    • Jakub Kicinski's avatar
      Merge tag 'linux-can-fixes-for-5.11-20210120' of... · 535d3159
      Jakub Kicinski authored
      Merge tag 'linux-can-fixes-for-5.11-20210120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
      
      Marc Kleine-Budde says:
      
      ====================
      linux-can-fixes-for-5.11-20210120
      
      All three patches are by Vincent Mailhol and fix a potential use after free bug
      in the CAN device infrastructure, the vxcan driver, and the peak_usk driver. In
      the TX-path the skb is used to read from after it was passed to the networking
      stack with netif_rx_ni().
      
      * tag 'linux-can-fixes-for-5.11-20210120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
        can: peak_usb: fix use after free bugs
        can: vxcan: vxcan_xmit: fix use after free bug
        can: dev: can_restart: fix use after free bug
      ====================
      
      Link: https://lore.kernel.org/r/20210120125202.2187358-1-mkl@pengutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      535d3159
    • Pan Bian's avatar
      net: systemport: free dev before on error path · 0c630a66
      Pan Bian authored
      On the error path, it should goto the error handling label to free
      allocated memory rather than directly return.
      
      Fixes: 31bc72d9 ("net: systemport: fetch and use clock resources")
      Signed-off-by: default avatarPan Bian <bianpan2016@163.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20210120044423.1704-1-bianpan2016@163.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0c630a66
    • Grant Grundler's avatar
      net: usb: cdc_ncm: don't spew notifications · de658a19
      Grant Grundler authored
      RTL8156 sends notifications about every 32ms.
      Only display/log notifications when something changes.
      
      This issue has been reported by others:
      	https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832472
      	https://lkml.org/lkml/2020/8/27/1083
      
      ...
      [785962.779840] usb 1-1: new high-speed USB device number 5 using xhci_hcd
      [785962.929944] usb 1-1: New USB device found, idVendor=0bda, idProduct=8156, bcdDevice=30.00
      [785962.929949] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=6
      [785962.929952] usb 1-1: Product: USB 10/100/1G/2.5G LAN
      [785962.929954] usb 1-1: Manufacturer: Realtek
      [785962.929956] usb 1-1: SerialNumber: 000000001
      [785962.991755] usbcore: registered new interface driver cdc_ether
      [785963.017068] cdc_ncm 1-1:2.0: MAC-Address: 00:24:27:88:08:15
      [785963.017072] cdc_ncm 1-1:2.0: setting rx_max = 16384
      [785963.017169] cdc_ncm 1-1:2.0: setting tx_max = 16384
      [785963.017682] cdc_ncm 1-1:2.0 usb0: register 'cdc_ncm' at usb-0000:00:14.0-1, CDC NCM, 00:24:27:88:08:15
      [785963.019211] usbcore: registered new interface driver cdc_ncm
      [785963.023856] usbcore: registered new interface driver cdc_wdm
      [785963.025461] usbcore: registered new interface driver cdc_mbim
      [785963.038824] cdc_ncm 1-1:2.0 enx002427880815: renamed from usb0
      [785963.089586] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected
      [785963.121673] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected
      [785963.153682] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected
      ...
      
      This is about 2KB per second and will overwrite all contents of a 1MB
      dmesg buffer in under 10 minutes rendering them useless for debugging
      many kernel problems.
      
      This is also an extra 180 MB/day in /var/logs (or 1GB per week) rendering
      the majority of those logs useless too.
      
      When the link is up (expected state), spew amount is >2x higher:
      ...
      [786139.600992] cdc_ncm 2-1:2.0 enx002427880815: network connection: connected
      [786139.632997] cdc_ncm 2-1:2.0 enx002427880815: 2500 mbit/s downlink 2500 mbit/s uplink
      [786139.665097] cdc_ncm 2-1:2.0 enx002427880815: network connection: connected
      [786139.697100] cdc_ncm 2-1:2.0 enx002427880815: 2500 mbit/s downlink 2500 mbit/s uplink
      [786139.729094] cdc_ncm 2-1:2.0 enx002427880815: network connection: connected
      [786139.761108] cdc_ncm 2-1:2.0 enx002427880815: 2500 mbit/s downlink 2500 mbit/s uplink
      ...
      
      Chrome OS cannot support RTL8156 until this is fixed.
      Signed-off-by: default avatarGrant Grundler <grundler@chromium.org>
      Reviewed-by: default avatarHayes Wang <hayeswang@realtek.com>
      Link: https://lore.kernel.org/r/20210120011208.3768105-1-grundler@chromium.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      de658a19
    • Alban Bedel's avatar
      net: mscc: ocelot: Fix multicast to the CPU port · 584b7cfc
      Alban Bedel authored
      Multicast entries in the MAC table use the high bits of the MAC
      address to encode the ports that should get the packets. But this port
      mask does not work for the CPU port, to receive these packets on the
      CPU port the MAC_CPU_COPY flag must be set.
      
      Because of this IPv6 was effectively not working because neighbor
      solicitations were never received. This was not apparent before commit
      9403c158 (net: mscc: ocelot: support IPv4, IPv6 and plain Ethernet mdb
      entries) as the IPv6 entries were broken so all incoming IPv6
      multicast was then treated as unknown and flooded on all ports.
      
      To fix this problem rework the ocelot_mact_learn() to set the
      MAC_CPU_COPY flag when a multicast entry that target the CPU port is
      added. For this we have to read back the ports endcoded in the pseudo
      MAC address by the caller. It is not a very nice design but that avoid
      changing the callers and should make backporting easier.
      Signed-off-by: default avatarAlban Bedel <alban.bedel@aerq.com>
      Fixes: 9403c158 ("net: mscc: ocelot: support IPv4, IPv6 and plain Ethernet mdb entries")
      Link: https://lore.kernel.org/r/20210119140638.203374-1-alban.bedel@aerq.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      584b7cfc
    • Kuniyuki Iwashima's avatar
      tcp: Fix potential use-after-free due to double kfree() · c89dffc7
      Kuniyuki Iwashima authored
      Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
      request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
      tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
      inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
      socket into ehash and sets NULL to ireq_opt. Otherwise,
      tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
      socket.
      
      The commit 01770a16 ("tcp: fix race condition when creating child
      sockets from syncookies") added a new path, in which more than one cores
      create full sockets for the same SYN cookie. Currently, the core which
      loses the race frees the full socket without resetting inet_opt, resulting
      in that both sock_put() and reqsk_put() call kfree() for the same memory:
      
        sock_put
          sk_free
            __sk_free
              sk_destruct
                __sk_destruct
                  sk->sk_destruct/inet_sock_destruct
                    kfree(rcu_dereference_protected(inet->inet_opt, 1));
      
        reqsk_put
          reqsk_free
            __reqsk_free
              req->rsk_ops->destructor/tcp_v4_reqsk_destructor
                kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));
      
      Calling kmalloc() between the double kfree() can lead to use-after-free, so
      this patch fixes it by setting NULL to inet_opt before sock_put().
      
      As a side note, this kind of issue does not happen for IPv6. This is
      because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
      correspond to ireq_opt in IPv4.
      
      Fixes: 01770a16 ("tcp: fix race condition when creating child sockets from syncookies")
      CC: Ricardo Dias <rdias@singlestore.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Reviewed-by: default avatarBenjamin Herrenschmidt <benh@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jpSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c89dffc7
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · b3741b43
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2021-01-20
      
      1) Fix wrong bpf_map_peek_elem_proto helper callback, from Mircea Cirjaliu.
      
      2) Fix signed_{sub,add32}_overflows type truncation, from Daniel Borkmann.
      
      3) Fix AF_XDP to also clear pools for inactive queues, from Maxim Mikityanskiy.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        bpf: Fix signed_{sub,add32}_overflows type handling
        xsk: Clear pool even for inactive queues
        bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
      ====================
      
      Link: https://lore.kernel.org/r/20210120163439.8160-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b3741b43