1. 30 Aug, 2024 13 commits
    • Alexey Gladkov's avatar
      bpf: Remove custom build rule · 1dd7622e
      Alexey Gladkov authored
      According to the documentation, when building a kernel with the C=2
      parameter, all source files should be checked. But this does not happen
      for the kernel/bpf/ directory.
      
      $ touch kernel/bpf/core.o
      $ make C=2 CHECK=true kernel/bpf/core.o
      
      Outputs:
      
        CHECK   scripts/mod/empty.c
        CALL    scripts/checksyscalls.sh
        DESCEND objtool
        INSTALL libsubcmd_headers
        CC      kernel/bpf/core.o
      
      As can be seen the compilation is done, but CHECK is not executed. This
      happens because kernel/bpf/Makefile has defined its own rule for
      compilation and forgotten the macro that does the check.
      
      There is no need to duplicate the build code, and this rule can be
      removed to use generic rules.
      Acked-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Tested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAlexey Gladkov <legion@kernel.org>
      Link: https://lore.kernel.org/r/20240830074350.211308-1-legion@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1dd7622e
    • Juntong Deng's avatar
      selftests/bpf: Add tests for iter next method returning valid pointer · 7c5f7b16
      Juntong Deng authored
      This patch adds test cases for iter next method returning valid
      pointer, which can also used as usage examples.
      
      Currently iter next method should return valid pointer.
      
      iter_next_trusted is the correct usage and test if iter next method
      return valid pointer. bpf_iter_task_vma_next has KF_RET_NULL flag,
      so the returned pointer may be NULL. We need to check if the pointer
      is NULL before using it.
      
      iter_next_trusted_or_null is the incorrect usage. There is no checking
      before using the pointer, so it will be rejected by the verifier.
      
      iter_next_rcu and iter_next_rcu_or_null are similar test cases for
      KF_RCU_PROTECTED iterators.
      
      iter_next_rcu_not_trusted is used to test that the pointer returned by
      iter next method of KF_RCU_PROTECTED iterator cannot be passed in
      KF_TRUSTED_ARGS kfuncs.
      
      iter_next_ptr_mem_not_trusted is used to test that base type
      PTR_TO_MEM should not be combined with type flag PTR_TRUSTED.
      Signed-off-by: default avatarJuntong Deng <juntong.deng@outlook.com>
      Link: https://lore.kernel.org/r/AM6PR03MB5848709758F6922F02AF9F1F99962@AM6PR03MB5848.eurprd03.prod.outlook.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7c5f7b16
    • Juntong Deng's avatar
      bpf: Make the pointer returned by iter next method valid · 4cc8c50c
      Juntong Deng authored
      Currently we cannot pass the pointer returned by iter next method as
      argument to KF_TRUSTED_ARGS or KF_RCU kfuncs, because the pointer
      returned by iter next method is not "valid".
      
      This patch sets the pointer returned by iter next method to be valid.
      
      This is based on the fact that if the iterator is implemented correctly,
      then the pointer returned from the iter next method should be valid.
      
      This does not make NULL pointer valid. If the iter next method has
      KF_RET_NULL flag, then the verifier will ask the ebpf program to
      check NULL pointer.
      
      KF_RCU_PROTECTED iterator is a special case, the pointer returned by
      iter next method should only be valid within RCU critical section,
      so it should be with MEM_RCU, not PTR_TRUSTED.
      
      Another special case is bpf_iter_num_next, which returns a pointer with
      base type PTR_TO_MEM. PTR_TO_MEM should not be combined with type flag
      PTR_TRUSTED (PTR_TO_MEM already means the pointer is valid).
      
      The pointer returned by iter next method of other types of iterators
      is with PTR_TRUSTED.
      
      In addition, this patch adds get_iter_from_state to help us get the
      current iterator from the current state.
      Signed-off-by: default avatarJuntong Deng <juntong.deng@outlook.com>
      Link: https://lore.kernel.org/r/AM6PR03MB584869F8B448EA1C87B7CDA399962@AM6PR03MB5848.eurprd03.prod.outlook.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4cc8c50c
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-add-gen_epilogue-to-bpf_verifier_ops' · f6284563
      Alexei Starovoitov authored
      Martin KaFai Lau says:
      
      ====================
      bpf: Add gen_epilogue to bpf_verifier_ops
      
      From: Martin KaFai Lau <martin.lau@kernel.org>
      
      This set allows the subsystem to patch codes before BPF_EXIT.
      The verifier ops, .gen_epilogue, is added for this purpose.
      One of the use case will be in the bpf qdisc, the bpf qdisc
      subsystem can ensure the skb->dev is in the correct value.
      The bpf qdisc subsystem can either inline fixing it in the
      epilogue or call another kernel function to handle it (e.g. drop)
      in the epilogue. Another use case could be in bpf_tcp_ca.c to
      enforce snd_cwnd has valid value (e.g. positive value).
      
      v5:
       * Removed the skip_cnt argument from adjust_jmp_off() in patch 2.
         Instead, reuse the delta argument and skip
         the [tgt_idx, tgt_idx + delta) instructions.
       * Added a BPF_JMP32_A macro in patch 3.
       * Removed pro_epilogue_subprog.c in patch 6.
         The pro_epilogue_kfunc.c has covered the subprog case.
         Renamed the file pro_epilogue_kfunc.c to pro_epilogue.c.
         Some of the SEC names and function names are changed
         accordingly (mainly shorten them by removing the _kfunc suffix).
       * Added comments to explain the tail_call result in patch 7.
       * Fixed the following bpf CI breakages. I ran it in CI
         manually to confirm:
         https://github.com/kernel-patches/bpf/actions/runs/10590714532
       * s390 zext added "w3 = w3". Adjusted the test to
         use all ALU64 and BPF_DW to avoid zext.
         Also changed the "int a" in the "struct st_ops_args" to "u64 a".
       * llvm17 does not take:
             *(u64 *)(r1 +0) = 0;
         so it is changed to:
             r3 = 0;
             *(u64 *)(r1 +0) = r3;
      
      v4:
       * Fixed a bug in the memcpy in patch 3
         The size in the memcpy should be
         epilogue_cnt * sizeof(*epilogue_buf)
      
      v3:
       * Moved epilogue_buf[16] to env.
         Patch 1 is added to move the existing insn_buf[16] to env.
       * Fixed a case that the bpf prog has a BPF_JMP that goes back
         to the first instruction of the main prog.
         The jump back to 1st insn case also applies to the prologue.
         Patch 2 is added to handle it.
       * If the bpf main prog has multiple BPF_EXIT, use a BPF_JA
         to goto the earlier patched epilogue.
         Note that there are (BPF_JMP32 | BPF_JA) vs (BPF_JMP | BPF_JA)
         details in the patch 3 commit message.
       * There are subtle changes in patch 3, so I reset the Reviewed-by.
       * Added patch 8 and patch 9 to cover the changes in patch 2 and patch 3.
       * Dropped the kfunc call from pro/epilogue and its selftests.
      
      v2:
       * Remove the RFC tag. Keep the ordering at where .gen_epilogue is
         called in the verifier relative to the check_max_stack_depth().
         This will be consistent with the other extra stack_depth
         usage like optimize_bpf_loop().
       * Use __xlated check provided by the test_loader to
         check the patched instructions after gen_pro/epilogue (Eduard).
       * Added Patch 3 by Eduard (Thanks!).
      ====================
      
      Link: https://lore.kernel.org/r/20240829210833.388152-1-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f6284563
    • Martin KaFai Lau's avatar
      selftests/bpf: Test epilogue patching when the main prog has multiple BPF_EXIT · cada0bdc
      Martin KaFai Lau authored
      This patch tests the epilogue patching when the main prog has
      multiple BPF_EXIT. The verifier should have patched the 2nd (and
      later) BPF_EXIT with a BPF_JA that goes back to the earlier
      patched epilogue instructions.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240829210833.388152-10-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cada0bdc
    • Martin KaFai Lau's avatar
      selftests/bpf: A pro/epilogue test when the main prog jumps back to the 1st insn · 42fdbbde
      Martin KaFai Lau authored
      This patch adds a pro/epilogue test when the main prog has a goto insn
      that goes back to the very first instruction of the prog. It is
      to test the correctness of the adjust_jmp_off(prog, 0, delta)
      after the verifier has applied the prologue and/or epilogue patch.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240829210833.388152-9-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      42fdbbde
    • Martin KaFai Lau's avatar
      selftests/bpf: Add tailcall epilogue test · b191b0fd
      Martin KaFai Lau authored
      This patch adds a gen_epilogue test to test a main prog
      using a bpf_tail_call.
      
      A non test_loader test is used. The tailcall target program,
      "test_epilogue_subprog", needs to be used in a struct_ops map
      before it can be loaded. Another struct_ops map is also needed
      to host the actual "test_epilogue_tailcall" struct_ops program
      that does the bpf_tail_call. The earlier test_loader patch
      will attach all struct_ops maps but the bpf_testmod.c does
      not support >1 attached struct_ops.
      
      The earlier patch used the test_loader which has already covered
      checking for the patched pro/epilogue instructions. This is done
      by the __xlated tag.
      
      This patch goes for the regular skel load and syscall test to do
      the tailcall test that can also allow to directly pass the
      the "struct st_ops_args *args" as ctx_in to the
      SEC("syscall") program.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240829210833.388152-8-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b191b0fd
    • Martin KaFai Lau's avatar
      selftests/bpf: Test gen_prologue and gen_epilogue · 47e69431
      Martin KaFai Lau authored
      This test adds a new struct_ops "bpf_testmod_st_ops" in bpf_testmod.
      The ops of the bpf_testmod_st_ops is triggered by new kfunc calls
      "bpf_kfunc_st_ops_test_*logue". These new kfunc calls are
      primarily used by the SEC("syscall") program. The test triggering
      sequence is like:
          SEC("syscall")
          syscall_prologue(struct st_ops_args *args)
              bpf_kfunc_st_op_test_prologue(args)
      	    st_ops->test_prologue(args)
      
      .gen_prologue adds 1000 to args->a
      .gen_epilogue adds 10000 to args->a
      .gen_epilogue will also set the r0 to 2 * args->a.
      
      The .gen_prologue and .gen_epilogue of the bpf_testmod_st_ops
      will test the prog->aux->attach_func_name to decide if
      it needs to generate codes.
      
      The main programs of the pro_epilogue.c will call a
      new kfunc bpf_kfunc_st_ops_inc10 which does "args->a += 10".
      It will also call a subprog() which does "args->a += 1".
      
      This patch uses the test_loader infra to check the __xlated
      instructions patched after gen_prologue and/or gen_epilogue.
      The __xlated check is based on Eduard's example (Thanks!) in v1.
      
      args->a is returned by the struct_ops prog (either the main prog
      or the epilogue). Thus, the __retval of the SEC("syscall") prog
      is checked. For example, when triggering the ops in the
      'SEC("struct_ops/test_epilogue") int test_epilogue'
      The expected args->a is +1 (subprog call) + 10 (kfunc call)
          	     	     	+ 10000 (.gen_epilogue) = 10011.
      The expected return value is 2 * 10011 (.gen_epilogue).
      Suggested-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240829210833.388152-7-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      47e69431
    • Eduard Zingerman's avatar
      selftests/bpf: attach struct_ops maps before test prog runs · a0dbf6d0
      Eduard Zingerman authored
      In test_loader based tests to bpf_map__attach_struct_ops()
      before call to bpf_prog_test_run_opts() in order to trigger
      bpf_struct_ops->reg() callbacks on kernel side.
      This allows to use __retval macro for struct_ops tests.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240829210833.388152-6-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a0dbf6d0
    • Martin KaFai Lau's avatar
      bpf: Export bpf_base_func_proto · 866d571e
      Martin KaFai Lau authored
      The bpf_testmod needs to use the bpf_tail_call helper in
      a later selftest patch. This patch is to EXPORT_GPL_SYMBOL
      the bpf_base_func_proto.
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240829210833.388152-5-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      866d571e
    • Martin KaFai Lau's avatar
      bpf: Add gen_epilogue to bpf_verifier_ops · 169c3176
      Martin KaFai Lau authored
      This patch adds a .gen_epilogue to the bpf_verifier_ops. It is similar
      to the existing .gen_prologue. Instead of allowing a subsystem
      to run code at the beginning of a bpf prog, it allows the subsystem
      to run code just before the bpf prog exit.
      
      One of the use case is to allow the upcoming bpf qdisc to ensure that
      the skb->dev is the same as the qdisc->dev_queue->dev. The bpf qdisc
      struct_ops implementation could either fix it up or drop the skb.
      Another use case could be in bpf_tcp_ca.c to enforce snd_cwnd
      has sane value (e.g. non zero).
      
      The epilogue can do the useful thing (like checking skb->dev) if it
      can access the bpf prog's ctx. Unlike prologue, r1 may not hold the
      ctx pointer. This patch saves the r1 in the stack if the .gen_epilogue
      has returned some instructions in the "epilogue_buf".
      
      The existing .gen_prologue is done in convert_ctx_accesses().
      The new .gen_epilogue is done in the convert_ctx_accesses() also.
      When it sees the (BPF_JMP | BPF_EXIT) instruction, it will be patched
      with the earlier generated "epilogue_buf". The epilogue patching is
      only done for the main prog.
      
      Only one epilogue will be patched to the main program. When the
      bpf prog has multiple BPF_EXIT instructions, a BPF_JA is used
      to goto the earlier patched epilogue. Majority of the archs
      support (BPF_JMP32 | BPF_JA): x86, arm, s390, risv64, loongarch,
      powerpc and arc. This patch keeps it simple and always
      use (BPF_JMP32 | BPF_JA). A new macro BPF_JMP32_A is added to
      generate the (BPF_JMP32 | BPF_JA) insn.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240829210833.388152-4-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      169c3176
    • Martin KaFai Lau's avatar
      bpf: Adjust BPF_JMP that jumps to the 1st insn of the prologue · d5c47719
      Martin KaFai Lau authored
      The next patch will add a ctx ptr saving instruction
      "(r1 = *(u64 *)(r10 -8)" at the beginning for the main prog
      when there is an epilogue patch (by the .gen_epilogue() verifier
      ops added in the next patch).
      
      There is one corner case if the bpf prog has a BPF_JMP that jumps
      to the 1st instruction. It needs an adjustment such that
      those BPF_JMP instructions won't jump to the newly added
      ctx saving instruction.
      The commit 5337ac4c ("bpf: Fix the corner case with may_goto and jump to the 1st insn.")
      has the details on this case.
      
      Note that the jump back to 1st instruction is not limited to the
      ctx ptr saving instruction. The same also applies to the prologue.
      A later test, pro_epilogue_goto_start.c, has a test for the prologue
      only case.
      
      Thus, this patch does one adjustment after gen_prologue and
      the future ctx ptr saving. It is done by
      adjust_jmp_off(env->prog, 0, delta) where delta has the total
      number of instructions in the prologue and
      the future ctx ptr saving instruction.
      
      The adjust_jmp_off(env->prog, 0, delta) assumes that the
      prologue does not have a goto 1st instruction itself.
      To accommodate the prologue might have a goto 1st insn itself,
      this patch changes the adjust_jmp_off() to skip considering
      the instructions between [tgt_idx, tgt_idx + delta).
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240829210833.388152-3-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d5c47719
    • Martin KaFai Lau's avatar
      bpf: Move insn_buf[16] to bpf_verifier_env · 6f606ffd
      Martin KaFai Lau authored
      This patch moves the 'struct bpf_insn insn_buf[16]' stack usage
      to the bpf_verifier_env. A '#define INSN_BUF_SIZE 16' is also added
      to replace the ARRAY_SIZE(insn_buf) usages.
      
      Both convert_ctx_accesses() and do_misc_fixup() are changed
      to use the env->insn_buf.
      
      It is a refactoring work for adding the epilogue_buf[16] in a later patch.
      
      With this patch, the stack size usage decreased.
      
      Before:
      ./kernel/bpf/verifier.c:22133:5: warning: stack frame size (2584)
      
      After:
      ./kernel/bpf/verifier.c:22184:5: warning: stack frame size (2264)
      Reviewed-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240829210833.388152-2-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6f606ffd
  2. 29 Aug, 2024 6 commits
  3. 28 Aug, 2024 4 commits
    • Hao Ge's avatar
      selftests/bpf: Fix incorrect parameters in NULL pointer checking · c264487e
      Hao Ge authored
      Smatch reported the following warning:
          ./tools/testing/selftests/bpf/testing_helpers.c:455 get_xlated_program()
          warn: variable dereferenced before check 'buf' (see line 454)
      
      It seems correct,so let's modify it based on it's suggestion.
      
      Actually,commit b23ed4d7 ("selftests/bpf: Fix invalid pointer
      check in get_xlated_program()") fixed an issue in the test_verifier.c
      once,but it was reverted this time.
      
      Let's solve this issue with the minimal changes possible.
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Closes: https://lore.kernel.org/all/1eb3732f-605a-479d-ba64-cd14250cbf91@stanley.mountain/
      Fixes: b4b7a409 ("selftests/bpf: Factor out get_xlated_program() helper")
      Signed-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Link: https://lore.kernel.org/r/20240820023622.29190-1-hao.ge@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c264487e
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-arm64-simplify-jited-prologue-epilogue' · 4961d8f4
      Alexei Starovoitov authored
      Xu Kuohai says:
      
      ====================
      bpf, arm64: Simplify jited prologue/epilogue
      
      From: Xu Kuohai <xukuohai@huawei.com>
      
      The arm64 jit blindly saves/restores all callee-saved registers, making
      the jited result looks a bit too compliated. For example, for an empty
      prog, the jited result is:
      
         0:   bti jc
         4:   mov     x9, lr
         8:   nop
         c:   paciasp
        10:   stp     fp, lr, [sp, #-16]!
        14:   mov     fp, sp
        18:   stp     x19, x20, [sp, #-16]!
        1c:   stp     x21, x22, [sp, #-16]!
        20:   stp     x26, x25, [sp, #-16]!
        24:   mov     x26, #0
        28:   stp     x26, x25, [sp, #-16]!
        2c:   mov     x26, sp
        30:   stp     x27, x28, [sp, #-16]!
        34:   mov     x25, sp
        38:   bti j 		// tailcall target
        3c:   sub     sp, sp, #0
        40:   mov     x7, #0
        44:   add     sp, sp, #0
        48:   ldp     x27, x28, [sp], #16
        4c:   ldp     x26, x25, [sp], #16
        50:   ldp     x26, x25, [sp], #16
        54:   ldp     x21, x22, [sp], #16
        58:   ldp     x19, x20, [sp], #16
        5c:   ldp     fp, lr, [sp], #16
        60:   mov     x0, x7
        64:   autiasp
        68:   ret
      
      Clearly, there is no need to save/restore unused callee-saved registers.
      This patch does this change, making the jited image to only save/restore
      the callee-saved registers it uses.
      
      Now the jited result of empty prog is:
      
         0:   bti jc
         4:   mov     x9, lr
         8:   nop
         c:   paciasp
        10:   stp     fp, lr, [sp, #-16]!
        14:   mov     fp, sp
        18:   stp     xzr, x26, [sp, #-16]!
        1c:   mov     x26, sp
        20:   bti j		// tailcall target
        24:   mov     x7, #0
        28:   ldp     xzr, x26, [sp], #16
        2c:   ldp     fp, lr, [sp], #16
        30:   mov     x0, x7
        34:   autiasp
        38:   ret
      ====================
      Acked-by: default avatarPuranjay Mohan <puranjay@kernel.org>
      Link: https://lore.kernel.org/r/20240826071624.350108-1-xukuohai@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4961d8f4
    • Xu Kuohai's avatar
      bpf, arm64: Avoid blindly saving/restoring all callee-saved registers · 5d4fa9ec
      Xu Kuohai authored
      The arm64 jit blindly saves/restores all callee-saved registers, making
      the jited result looks a bit too compliated. For example, for an empty
      prog, the jited result is:
      
         0:   bti jc
         4:   mov     x9, lr
         8:   nop
         c:   paciasp
        10:   stp     fp, lr, [sp, #-16]!
        14:   mov     fp, sp
        18:   stp     x19, x20, [sp, #-16]!
        1c:   stp     x21, x22, [sp, #-16]!
        20:   stp     x26, x25, [sp, #-16]!
        24:   mov     x26, #0
        28:   stp     x26, x25, [sp, #-16]!
        2c:   mov     x26, sp
        30:   stp     x27, x28, [sp, #-16]!
        34:   mov     x25, sp
        38:   bti j 		// tailcall target
        3c:   sub     sp, sp, #0
        40:   mov     x7, #0
        44:   add     sp, sp, #0
        48:   ldp     x27, x28, [sp], #16
        4c:   ldp     x26, x25, [sp], #16
        50:   ldp     x26, x25, [sp], #16
        54:   ldp     x21, x22, [sp], #16
        58:   ldp     x19, x20, [sp], #16
        5c:   ldp     fp, lr, [sp], #16
        60:   mov     x0, x7
        64:   autiasp
        68:   ret
      
      Clearly, there is no need to save/restore unused callee-saved registers.
      This patch does this change, making the jited image to only save/restore
      the callee-saved registers it uses.
      
      Now the jited result of empty prog is:
      
         0:   bti jc
         4:   mov     x9, lr
         8:   nop
         c:   paciasp
        10:   stp     fp, lr, [sp, #-16]!
        14:   mov     fp, sp
        18:   stp     xzr, x26, [sp, #-16]!
        1c:   mov     x26, sp
        20:   bti j		// tailcall target
        24:   mov     x7, #0
        28:   ldp     xzr, x26, [sp], #16
        2c:   ldp     fp, lr, [sp], #16
        30:   mov     x0, x7
        34:   autiasp
        38:   ret
      
      Since bpf prog saves/restores its own callee-saved registers as needed,
      to make tailcall work correctly, the caller needs to restore its saved
      registers before tailcall, and the callee needs to save its callee-saved
      registers after tailcall. This extra restoring/saving instructions
      increases preformance overhead.
      
      [1] provides 2 benchmarks for tailcall scenarios. Below is the perf
      number measured in an arm64 KVM guest. The result indicates that the
      performance difference before and after the patch in typical tailcall
      scenarios is negligible.
      
      - Before:
      
       Performance counter stats for './test_progs -t tailcalls' (5 runs):
      
                 4313.43 msec task-clock                       #    0.874 CPUs utilized               ( +-  0.16% )
                     574      context-switches                 #  133.073 /sec                        ( +-  1.14% )
                       0      cpu-migrations                   #    0.000 /sec
                     538      page-faults                      #  124.727 /sec                        ( +-  0.57% )
             10697772784      cycles                           #    2.480 GHz                         ( +-  0.22% )  (61.19%)
             25511241955      instructions                     #    2.38  insn per cycle              ( +-  0.08% )  (66.70%)
              5108910557      branches                         #    1.184 G/sec                       ( +-  0.08% )  (72.38%)
                 2800459      branch-misses                    #    0.05% of all branches             ( +-  0.51% )  (72.36%)
                              TopDownL1                 #     0.60 retiring                    ( +-  0.09% )  (66.84%)
                                                        #     0.21 frontend_bound              ( +-  0.15% )  (61.31%)
                                                        #     0.12 bad_speculation             ( +-  0.08% )  (50.11%)
                                                        #     0.07 backend_bound               ( +-  0.16% )  (33.30%)
              8274201819      L1-dcache-loads                  #    1.918 G/sec                       ( +-  0.18% )  (33.15%)
                  468268      L1-dcache-load-misses            #    0.01% of all L1-dcache accesses   ( +-  4.69% )  (33.16%)
                  385383      LLC-loads                        #   89.345 K/sec                       ( +-  5.22% )  (33.16%)
                   38296      LLC-load-misses                  #    9.94% of all LL-cache accesses    ( +- 42.52% )  (38.69%)
              6886576501      L1-icache-loads                  #    1.597 G/sec                       ( +-  0.35% )  (38.69%)
                 1848585      L1-icache-load-misses            #    0.03% of all L1-icache accesses   ( +-  4.52% )  (44.23%)
              9043645883      dTLB-loads                       #    2.097 G/sec                       ( +-  0.10% )  (44.33%)
                  416672      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  5.15% )  (49.89%)
              6925626111      iTLB-loads                       #    1.606 G/sec                       ( +-  0.35% )  (55.46%)
                   66220      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  1.88% )  (55.50%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                  4.9372 +- 0.0526 seconds time elapsed  ( +-  1.07% )
      
       Performance counter stats for './test_progs -t flow_dissector' (5 runs):
      
                10924.50 msec task-clock                       #    0.945 CPUs utilized               ( +-  0.08% )
                     603      context-switches                 #   55.197 /sec                        ( +-  1.13% )
                       0      cpu-migrations                   #    0.000 /sec
                     566      page-faults                      #   51.810 /sec                        ( +-  0.42% )
             27381270695      cycles                           #    2.506 GHz                         ( +-  0.18% )  (60.46%)
             56996583922      instructions                     #    2.08  insn per cycle              ( +-  0.21% )  (66.11%)
             10321647567      branches                         #  944.816 M/sec                       ( +-  0.17% )  (71.79%)
                 3347735      branch-misses                    #    0.03% of all branches             ( +-  3.72% )  (72.15%)
                              TopDownL1                 #     0.52 retiring                    ( +-  0.13% )  (66.74%)
                                                        #     0.27 frontend_bound              ( +-  0.14% )  (61.27%)
                                                        #     0.14 bad_speculation             ( +-  0.19% )  (50.36%)
                                                        #     0.07 backend_bound               ( +-  0.42% )  (33.89%)
             18740797617      L1-dcache-loads                  #    1.715 G/sec                       ( +-  0.43% )  (33.71%)
                13715669      L1-dcache-load-misses            #    0.07% of all L1-dcache accesses   ( +- 32.85% )  (33.34%)
                 4087551      LLC-loads                        #  374.164 K/sec                       ( +- 29.53% )  (33.26%)
                  267906      LLC-load-misses                  #    6.55% of all LL-cache accesses    ( +- 23.90% )  (38.76%)
             15811864229      L1-icache-loads                  #    1.447 G/sec                       ( +-  0.12% )  (38.73%)
                 2976833      L1-icache-load-misses            #    0.02% of all L1-icache accesses   ( +-  9.73% )  (44.22%)
             20138907471      dTLB-loads                       #    1.843 G/sec                       ( +-  0.18% )  (44.15%)
                  732850      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +- 11.18% )  (49.64%)
             15895726702      iTLB-loads                       #    1.455 G/sec                       ( +-  0.15% )  (55.13%)
                  152075      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  4.71% )  (54.98%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                 11.5613 +- 0.0317 seconds time elapsed  ( +-  0.27% )
      
      - After:
      
       Performance counter stats for './test_progs -t tailcalls' (5 runs):
      
                 4278.78 msec task-clock                       #    0.871 CPUs utilized               ( +-  0.15% )
                     569      context-switches                 #  132.982 /sec                        ( +-  0.58% )
                       0      cpu-migrations                   #    0.000 /sec
                     539      page-faults                      #  125.970 /sec                        ( +-  0.43% )
             10588986432      cycles                           #    2.475 GHz                         ( +-  0.20% )  (60.91%)
             25303825043      instructions                     #    2.39  insn per cycle              ( +-  0.08% )  (66.48%)
              5110756256      branches                         #    1.194 G/sec                       ( +-  0.07% )  (72.03%)
                 2719569      branch-misses                    #    0.05% of all branches             ( +-  2.42% )  (72.03%)
                              TopDownL1                 #     0.60 retiring                    ( +-  0.22% )  (66.31%)
                                                        #     0.22 frontend_bound              ( +-  0.21% )  (60.83%)
                                                        #     0.12 bad_speculation             ( +-  0.26% )  (50.25%)
                                                        #     0.06 backend_bound               ( +-  0.17% )  (33.52%)
              8163648527      L1-dcache-loads                  #    1.908 G/sec                       ( +-  0.33% )  (33.52%)
                  694979      L1-dcache-load-misses            #    0.01% of all L1-dcache accesses   ( +- 30.53% )  (33.52%)
                 1902347      LLC-loads                        #  444.600 K/sec                       ( +- 48.84% )  (33.69%)
                   96677      LLC-load-misses                  #    5.08% of all LL-cache accesses    ( +- 43.48% )  (39.30%)
              6863517589      L1-icache-loads                  #    1.604 G/sec                       ( +-  0.37% )  (39.17%)
                 1871519      L1-icache-load-misses            #    0.03% of all L1-icache accesses   ( +-  6.78% )  (44.56%)
              8927782813      dTLB-loads                       #    2.087 G/sec                       ( +-  0.14% )  (44.37%)
                  438237      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  6.00% )  (49.75%)
              6886906831      iTLB-loads                       #    1.610 G/sec                       ( +-  0.36% )  (55.08%)
                   67568      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  3.27% )  (54.86%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                  4.9114 +- 0.0309 seconds time elapsed  ( +-  0.63% )
      
       Performance counter stats for './test_progs -t flow_dissector' (5 runs):
      
                10948.40 msec task-clock                       #    0.942 CPUs utilized               ( +-  0.05% )
                     615      context-switches                 #   56.173 /sec                        ( +-  1.65% )
                       1      cpu-migrations                   #    0.091 /sec                        ( +- 31.62% )
                     567      page-faults                      #   51.788 /sec                        ( +-  0.44% )
             27334194328      cycles                           #    2.497 GHz                         ( +-  0.08% )  (61.05%)
             56656528828      instructions                     #    2.07  insn per cycle              ( +-  0.08% )  (66.67%)
             10270389422      branches                         #  938.072 M/sec                       ( +-  0.10% )  (72.21%)
                 3453837      branch-misses                    #    0.03% of all branches             ( +-  3.75% )  (72.27%)
                              TopDownL1                 #     0.52 retiring                    ( +-  0.16% )  (66.55%)
                                                        #     0.27 frontend_bound              ( +-  0.09% )  (60.91%)
                                                        #     0.14 bad_speculation             ( +-  0.08% )  (49.85%)
                                                        #     0.07 backend_bound               ( +-  0.16% )  (33.33%)
             18982866028      L1-dcache-loads                  #    1.734 G/sec                       ( +-  0.24% )  (33.34%)
                 8802454      L1-dcache-load-misses            #    0.05% of all L1-dcache accesses   ( +- 52.30% )  (33.31%)
                 2612962      LLC-loads                        #  238.661 K/sec                       ( +- 29.78% )  (33.45%)
                  264107      LLC-load-misses                  #   10.11% of all LL-cache accesses    ( +- 18.34% )  (39.07%)
             15793205997      L1-icache-loads                  #    1.443 G/sec                       ( +-  0.15% )  (39.09%)
                 3930802      L1-icache-load-misses            #    0.02% of all L1-icache accesses   ( +-  3.72% )  (44.66%)
             20097828496      dTLB-loads                       #    1.836 G/sec                       ( +-  0.09% )  (44.68%)
                  961757      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  3.32% )  (50.15%)
             15838728506      iTLB-loads                       #    1.447 G/sec                       ( +-  0.09% )  (55.62%)
                  167652      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  1.28% )  (55.52%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                 11.6173 +- 0.0268 seconds time elapsed  ( +-  0.23% )
      
      [1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d4fa9ec
    • Xu Kuohai's avatar
      bpf, arm64: Get rid of fpb · bd737fcb
      Xu Kuohai authored
      bpf prog accesses stack using BPF_FP as the base address and a negative
      immediate number as offset. But arm64 ldr/str instructions only support
      non-negative immediate number as offset. To simplify the jited result,
      commit 5b3d19b9 ("bpf, arm64: Adjust the offset of str/ldr(immediate)
      to positive number") introduced FPB to represent the lowest stack address
      that the bpf prog being jited may access, and with this address as the
      baseline, it converts BPF_FP plus negative immediate offset number to FPB
      plus non-negative immediate offset.
      
      Considering that for a given bpf prog, the jited stack space is fixed
      with A64_SP as the lowest address and BPF_FP as the highest address.
      Thus we can get rid of FPB and converts BPF_FP plus negative immediate
      offset to A64_SP plus non-negative immediate offset.
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Link: https://lore.kernel.org/r/20240826071624.350108-2-xukuohai@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bd737fcb
  4. 27 Aug, 2024 1 commit
  5. 23 Aug, 2024 15 commits
  6. 22 Aug, 2024 1 commit