1. 11 Apr, 2024 11 commits
  2. 09 Apr, 2024 2 commits
  3. 08 Apr, 2024 1 commit
  4. 06 Apr, 2024 4 commits
  5. 05 Apr, 2024 12 commits
  6. 04 Apr, 2024 9 commits
  7. 03 Apr, 2024 1 commit
    • Alexei Starovoitov's avatar
      Merge branch 'add-internal-only-bpf-per-cpu-instruction' · 519e1de9
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      Add internal-only BPF per-CPU instruction
      
      Add a new BPF instruction for resolving per-CPU memory addresses.
      
      New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_X, with
      insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
      to an absolute address where per-CPU data resides for "this" CPU.
      
      This patch set implements support for it in x86-64 BPF JIT only.
      
      Using the new instruction, we also implement inlining for three cases:
        - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
          function call, saving a bit of performance and also not polluting LBR
          records with unnecessary function call/return records;
        - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
          performance to implementing per-CPU data structures using global variables
          in BPF (which is an awesome improvement, see benchmarks below);
        - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
          same for non-PERCPU HASH map; this still saves a bit of overhead.
      
      To validate performance benefits, I hacked together a tiny benchmark doing
      only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
      (arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
      establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
      on global variable array using bpf_get_smp_processor_id() to index array for
      current CPU (glob-arr-inc benchmark below).
      
      BEFORE
      ======
      glob-arr-inc   :  163.685 ± 0.092M/s
      arr-inc        :  138.096 ± 0.160M/s
      hash-inc       :   66.855 ± 0.123M/s
      
      AFTER
      =====
      glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
      arr-inc        :  170.729 ± 0.210M/s (+23.7%)
      hash-inc       :   68.673 ± 0.070M/s (+2.7%)
      
      As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
      array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().
      
      But what's really important is that arr-inc benchmark basically catches up
      with glob-arr-inc, resulting in +23.7% improvement. This means that in
      practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
      critical (e.g., high-frequent stats collection, which is often a practical use
      for PERCPU_ARRAY today).
      
      v1->v2:
        - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
        - dropped the direct per-CPU memory read instruction, it can always be added
          back, if necessary;
        - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
        - switched all per-cpu addr casts to (unsigned long) to avoid sparse
          warnings.
      ====================
      
      Link: https://lore.kernel.org/r/20240402021307.1012571-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      519e1de9