• Andrii Nakryiko's avatar
    selftests/bpf: scale benchmark counting by using per-CPU counters · 520fad2e
    Andrii Nakryiko authored
    When benchmarking with multiple threads (-pN, where N>1), we start
    contending on single atomic counter that both BPF trigger benchmarks are
    using, as well as "baseline" tests in user space (trig-base and
    trig-uprobe-base benchmarks). As such, we start bottlenecking on
    something completely irrelevant to benchmark at hand.
    
    Scale counting up by using per-CPU counters on BPF side. On use space
    side we do the next best thing: hash thread ID to approximate per-CPU
    behavior. It seems to work quite well in practice.
    
    To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8,
    16, and 32 threads:
      - trig-uprobe-base (no syscalls, pure tight counting loop in user-space);
      - trig-base (get_pgid() syscall, atomic counter in user-space);
      - trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU
        counter on BPF side).
    
    Command used:
    
      for b in uprobe-base base fentry; do \
        for p in 1 2 4 8 16 32; do \
          printf "%-11s %2d: %s\n" $b $p \
            "$(sudo ./bench -w2 -d5 -a -p$p trig-$b | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)"; \
        done; \
      done
    
    Before these changes, aggregate throughput across all threads doesn't
    scale well with number of threads, it actually even falls sharply for
    uprobe-base due to a very high contention:
    
      uprobe-base  1:  138.998 ± 0.650M/s
      uprobe-base  2:   70.526 ± 1.147M/s
      uprobe-base  4:   63.114 ± 0.302M/s
      uprobe-base  8:   54.177 ± 0.138M/s
      uprobe-base 16:   45.439 ± 0.057M/s
      uprobe-base 32:   37.163 ± 0.242M/s
      base         1:   16.940 ± 0.182M/s
      base         2:   19.231 ± 0.105M/s
      base         4:   21.479 ± 0.038M/s
      base         8:   23.030 ± 0.037M/s
      base        16:   22.034 ± 0.004M/s
      base        32:   18.152 ± 0.013M/s
      fentry       1:   14.794 ± 0.054M/s
      fentry       2:   17.341 ± 0.055M/s
      fentry       4:   23.792 ± 0.024M/s
      fentry       8:   21.557 ± 0.047M/s
      fentry      16:   21.121 ± 0.004M/s
      fentry      32:   17.067 ± 0.023M/s
    
    After these changes, we see almost perfect linear scaling, as expected.
    The sub-linear scaling when going from 8 to 16 threads is interesting
    and consistent on my test machine, but I haven't investigated what is
    causing it this peculiar slowdown (across all benchmarks, could be due
    to hyperthreading effects, not sure).
    
      uprobe-base  1:  139.980 ± 0.648M/s
      uprobe-base  2:  270.244 ± 0.379M/s
      uprobe-base  4:  532.044 ± 1.519M/s
      uprobe-base  8: 1004.571 ± 3.174M/s
      uprobe-base 16: 1720.098 ± 0.744M/s
      uprobe-base 32: 3506.659 ± 8.549M/s
      base         1:   16.869 ± 0.071M/s
      base         2:   33.007 ± 0.092M/s
      base         4:   64.670 ± 0.203M/s
      base         8:  121.969 ± 0.210M/s
      base        16:  207.832 ± 0.112M/s
      base        32:  424.227 ± 1.477M/s
      fentry       1:   14.777 ± 0.087M/s
      fentry       2:   28.575 ± 0.146M/s
      fentry       4:   56.234 ± 0.176M/s
      fentry       8:  106.095 ± 0.385M/s
      fentry      16:  181.440 ± 0.032M/s
      fentry      32:  369.131 ± 0.693M/s
    Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
    Message-ID: <20240315213329.1161589-1-andrii@kernel.org>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    520fad2e
trigger_bench.c 1.83 KB