Commit 8e7c2a02 authored by Andrii Nakryiko's avatar Andrii Nakryiko Committed by Alexei Starovoitov

selftests/bpf: Add benchmark runner infrastructure

While working on BPF ringbuf implementation, testing, and benchmarking, I've
developed a pretty generic and modular benchmark runner, which seems to be
generically useful, as I've already used it for one more purpose (testing
fastest way to trigger BPF program, to minimize overhead of in-kernel code).

This patch adds generic part of benchmark runner and sets up Makefile for
extending it with more sets of benchmarks.

Benchmarker itself operates by spinning up specified number of producer and
consumer threads, setting up interval timer sending SIGALARM signal to
application once a second. Every second, current snapshot with hits/drops
counters are collected and stored in an array. Drops are useful for
producer/consumer benchmarks in which producer might overwhelm consumers.

Once test finishes after given amount of warm-up and testing seconds, mean and
stddev are calculated (ignoring warm-up results) and is printed out to stdout.
This setup seems to give consistent and accurate results.

To validate behavior, I added two atomic counting tests: global and local.
For global one, all the producer threads are atomically incrementing same
counter as fast as possible. This, of course, leads to huge drop of
performance once there is more than one producer thread due to CPUs fighting
for the same memory location.

Local counting, on the other hand, maintains one counter per each producer
thread, incremented independently. Once per second, all counters are read and
added together to form final "counting throughput" measurement. As expected,
such setup demonstrates linear scalability with number of producers (as long
as there are enough physical CPU cores, of course). See example output below.
Also, this setup can nicely demonstrate disastrous effects of false sharing,
if care is not taken to take those per-producer counters apart into
independent cache lines.

Demo output shows global counter first with 1 producer, then with 4. Both
total and per-producer performance significantly drop. The last run is local
counter with 4 producers, demonstrating near-perfect scalability.

$ ./bench -a -w1 -d2 -p1 count-global
Setting up benchmark 'count-global'...
Benchmark 'count-global' started.
Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s

$ ./bench -a -w1 -d2 -p4 count-global
Setting up benchmark 'count-global'...
Benchmark 'count-global' started.
Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s

$ ./bench -a -w1 -d2 -p4 count-local
Setting up benchmark 'count-local'...
Benchmark 'count-local' started.
Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s

Benchmark runner supports setting thread affinity for producer and consumer
threads. You can use -a flag for default CPU selection scheme, where first
consumer gets CPU #0, next one gets CPU #1, and so on. Then producer threads
pick up next CPU and increment one-by-one as well. But user can also specify
a set of CPUs independently for producers and consumers with --prod-affinity
1,2-10,15 and --cons-affinity <set-of-cpus>. The latter allows to force
producers and consumers to share same set of CPUs, if necessary.
Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
Acked-by: default avatarYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200512192445.2351848-3-andriin@fb.com
parent cd49291c
......@@ -38,3 +38,4 @@ test_cpp
/bpf_gcc
/tools
/runqslower
/bench
......@@ -77,7 +77,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
# Compile but not part of 'make run_tests'
TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
test_lirc_mode2_user xdping test_cpp runqslower
test_lirc_mode2_user xdping test_cpp runqslower bench
TEST_CUSTOM_PROGS = urandom_read
......@@ -407,6 +407,17 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
$(call msg,CXX,,$@)
$(CXX) $(CFLAGS) $^ $(LDLIBS) -o $@
# Benchmark runner
$(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
$(call msg,CC,,$@)
$(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
$(OUTPUT)/bench.o: bench.h testing_helpers.h
$(OUTPUT)/bench: LDLIBS += -lm
$(OUTPUT)/bench: $(OUTPUT)/bench.o $(OUTPUT)/testing_helpers.o \
$(OUTPUT)/bench_count.o
$(call msg,BINARY,,$@)
$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR) \
prog_tests/tests.h map_tests/tests.h verifier/tests.h \
feature \
......
This diff is collapsed.
/* SPDX-License-Identifier: GPL-2.0 */
#pragma once
#include <stdlib.h>
#include <stdbool.h>
#include <linux/err.h>
#include <errno.h>
#include <unistd.h>
#include <bpf/bpf.h>
#include <bpf/libbpf.h>
#include <math.h>
#include <time.h>
#include <sys/syscall.h>
struct cpu_set {
bool *cpus;
int cpus_len;
int next_cpu;
};
struct env {
char *bench_name;
int duration_sec;
int warmup_sec;
bool verbose;
bool list;
bool affinity;
int consumer_cnt;
int producer_cnt;
struct cpu_set prod_cpus;
struct cpu_set cons_cpus;
};
struct bench_res {
long hits;
long drops;
};
struct bench {
const char *name;
void (*validate)();
void (*setup)();
void *(*producer_thread)(void *ctx);
void *(*consumer_thread)(void *ctx);
void (*measure)(struct bench_res* res);
void (*report_progress)(int iter, struct bench_res* res, long delta_ns);
void (*report_final)(struct bench_res res[], int res_cnt);
};
struct counter {
long value;
} __attribute__((aligned(128)));
extern struct env env;
extern const struct bench *bench;
void setup_libbpf();
void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns);
void hits_drops_report_final(struct bench_res res[], int res_cnt);
static inline __u64 get_time_ns() {
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, &t);
return (u64)t.tv_sec * 1000000000 + t.tv_nsec;
}
static inline void atomic_inc(long *value)
{
(void)__atomic_add_fetch(value, 1, __ATOMIC_RELAXED);
}
static inline void atomic_add(long *value, long n)
{
(void)__atomic_add_fetch(value, n, __ATOMIC_RELAXED);
}
static inline long atomic_swap(long *value, long n)
{
return __atomic_exchange_n(value, n, __ATOMIC_RELAXED);
}
// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2020 Facebook */
#include "bench.h"
/* COUNT-GLOBAL benchmark */
static struct count_global_ctx {
struct counter hits;
} count_global_ctx;
static void *count_global_producer(void *input)
{
struct count_global_ctx *ctx = &count_global_ctx;
while (true) {
atomic_inc(&ctx->hits.value);
}
return NULL;
}
static void *count_global_consumer(void *input)
{
return NULL;
}
static void count_global_measure(struct bench_res *res)
{
struct count_global_ctx *ctx = &count_global_ctx;
res->hits = atomic_swap(&ctx->hits.value, 0);
}
/* COUNT-local benchmark */
static struct count_local_ctx {
struct counter *hits;
} count_local_ctx;
static void count_local_setup()
{
struct count_local_ctx *ctx = &count_local_ctx;
ctx->hits = calloc(env.consumer_cnt, sizeof(*ctx->hits));
if (!ctx->hits)
exit(1);
}
static void *count_local_producer(void *input)
{
struct count_local_ctx *ctx = &count_local_ctx;
int idx = (long)input;
while (true) {
atomic_inc(&ctx->hits[idx].value);
}
return NULL;
}
static void *count_local_consumer(void *input)
{
return NULL;
}
static void count_local_measure(struct bench_res *res)
{
struct count_local_ctx *ctx = &count_local_ctx;
int i;
for (i = 0; i < env.producer_cnt; i++) {
res->hits += atomic_swap(&ctx->hits[i].value, 0);
}
}
const struct bench bench_count_global = {
.name = "count-global",
.producer_thread = count_global_producer,
.consumer_thread = count_global_consumer,
.measure = count_global_measure,
.report_progress = hits_drops_report_progress,
.report_final = hits_drops_report_final,
};
const struct bench bench_count_local = {
.name = "count-local",
.setup = count_local_setup,
.producer_thread = count_local_producer,
.consumer_thread = count_local_consumer,
.measure = count_local_measure,
.report_progress = hits_drops_report_progress,
.report_final = hits_drops_report_final,
};
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment