1. 08 Mar, 2021 3 commits
    • Björn Töpel's avatar
      libbpf, xsk: Add libbpf_smp_store_release libbpf_smp_load_acquire · 291471dd
      Björn Töpel authored
      Now that the AF_XDP rings have load-acquire/store-release semantics,
      move libbpf to that as well.
      
      The library-internal libbpf_smp_{load_acquire,store_release} are only
      valid for 32-bit words on ARM64.
      
      Also, remove the barriers that are no longer in use.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210305094113.413544-3-bjorn.topel@gmail.com
      291471dd
    • Björn Töpel's avatar
      xsk: Update rings for load-acquire/store-release barriers · a23b3f56
      Björn Töpel authored
      Currently, the AF_XDP rings uses general smp_{r,w,}mb() barriers on
      the kernel-side. On most modern architectures
      load-acquire/store-release barriers perform better, and results in
      simpler code for circular ring buffers.
      
      This change updates the XDP socket rings to use
      load-acquire/store-release barriers.
      
      It is important to note that changing from the old smp_{r,w,}mb()
      barriers, to load-acquire/store-release barriers does not break
      compatibility. The old semantics work with the new one, and vice
      versa.
      
      As pointed out by "Documentation/memory-barriers.txt" in the "SMP
      BARRIER PAIRING" section:
      
        "General barriers pair with each other, though they also pair with
        most other types of barriers, albeit without multicopy atomicity.
        An acquire barrier pairs with a release barrier, but both may also
        pair with other barriers, including of course general barriers."
      
      How different barriers behaves and pairs is outlined in
      "tools/memory-model/Documentation/cheatsheet.txt".
      
      In order to make sure that compatibility is not broken, LKMM herd7
      based litmus tests can be constructed and verified.
      
      We generalize the XDP socket ring to a one entry ring, and create two
      scenarios; One where the ring is full, where only the consumer can
      proceed, followed by the producer. One where the ring is empty, where
      only the producer can proceed, followed by the consumer. Each scenario
      is then expanded to four different tests: general producer/general
      consumer, general producer/acqrel consumer, acqrel producer/general
      consumer, acqrel producer/acqrel consumer. In total eight tests.
      
      The empty ring test:
        C spsc-rb+empty
      
        // Simple one entry ring:
        // prod cons     allowed action       prod cons
        //    0    0 =>       prod          =>   1    0
        //    0    1 =>       cons          =>   0    0
        //    1    0 =>       cons          =>   1    1
        //    1    1 =>       prod          =>   0    1
      
        {}
      
        // We start at prod==0, cons==0, data==0, i.e. nothing has been
        // written to the ring. From here only the producer can start, and
        // should write 1. Afterwards, consumer can continue and read 1 to
        // data. Can we enter state prod==1, cons==1, but consumer observed
        // the incorrect value of 0?
      
        P0(int *prod, int *cons, int *data)
        {
           ... producer
        }
      
        P1(int *prod, int *cons, int *data)
        {
           ... consumer
        }
      
        exists( 1:d=0 /\ prod=1 /\ cons=1 );
      
      The full ring test:
        C spsc-rb+full
      
        // Simple one entry ring:
        // prod cons     allowed action       prod cons
        //    0    0 =>       prod          =>   1    0
        //    0    1 =>       cons          =>   0    0
        //    1    0 =>       cons          =>   1    1
        //    1    1 =>       prod          =>   0    1
      
        { prod = 1; }
      
        // We start at prod==1, cons==0, data==1, i.e. producer has
        // written 0, so from here only the consumer can start, and should
        // consume 0. Afterwards, producer can continue and write 1 to
        // data. Can we enter state prod==0, cons==1, but consumer observed
        // the write of 1?
      
        P0(int *prod, int *cons, int *data)
        {
          ... producer
        }
      
        P1(int *prod, int *cons, int *data)
        {
          ... consumer
        }
      
        exists( 1:d=1 /\ prod=0 /\ cons=1 );
      
      where P0 and P1 are:
      
        P0(int *prod, int *cons, int *data)
        {
        	int p;
      
        	p = READ_ONCE(*prod);
        	if (READ_ONCE(*cons) == p) {
        		WRITE_ONCE(*data, 1);
        		smp_wmb();
        		WRITE_ONCE(*prod, p ^ 1);
        	}
        }
      
        P0(int *prod, int *cons, int *data)
        {
        	int p;
      
        	p = READ_ONCE(*prod);
        	if (READ_ONCE(*cons) == p) {
        		WRITE_ONCE(*data, 1);
        		smp_store_release(prod, p ^ 1);
        	}
        }
      
        P1(int *prod, int *cons, int *data)
        {
        	int c;
        	int d = -1;
      
        	c = READ_ONCE(*cons);
        	if (READ_ONCE(*prod) != c) {
        		smp_rmb();
        		d = READ_ONCE(*data);
        		smp_mb();
        		WRITE_ONCE(*cons, c ^ 1);
        	}
        }
      
        P1(int *prod, int *cons, int *data)
        {
        	int c;
        	int d = -1;
      
        	c = READ_ONCE(*cons);
        	if (smp_load_acquire(prod) != c) {
        		d = READ_ONCE(*data);
        		smp_store_release(cons, c ^ 1);
        	}
        }
      
      The full LKMM litmus tests are found at [1].
      
      On x86-64 systems the l2fwd AF_XDP xdpsock sample performance
      increases by 1%. This is mostly due to that the smp_mb() is removed,
      which is a relatively expensive operation on these
      platforms. Weakly-ordered platforms, such as ARM64 might benefit even
      more.
      
      [1] https://github.com/bjoto/litmus-xskSigned-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210305094113.413544-2-bjorn.topel@gmail.com
      a23b3f56
    • Jiri Olsa's avatar
      selftests/bpf: Fix test_attach_probe for powerpc uprobes · 299194a9
      Jiri Olsa authored
      When testing uprobes we the test gets GEP (Global Entry Point)
      address from kallsyms, but then the function is called locally
      so the uprobe is not triggered.
      
      Fixing this by adjusting the address to LEP (Local Entry Point)
      for powerpc arch plus instruction check stolen from ppc_function_entry
      function pointed out and explained by Michael and Naveen.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Link: https://lore.kernel.org/bpf/20210305134050.139840-1-jolsa@kernel.org
      299194a9
  2. 05 Mar, 2021 36 commits
  3. 04 Mar, 2021 1 commit
    • Yonghong Song's avatar
      selftests/bpf: Add a verifier scale test with unknown bounded loop · 86a35af6
      Yonghong Song authored
      The original bcc pull request https://github.com/iovisor/bcc/pull/3270 exposed
      a verifier failure with Clang 12/13 while Clang 4 works fine.
      
      Further investigation exposed two issues:
      
        Issue 1: LLVM may generate code which uses less refined value. The issue is
                 fixed in LLVM patch: https://reviews.llvm.org/D97479
      
        Issue 2: Spills with initial value 0 are marked as precise which makes later
                 state pruning less effective. This is my rough initial analysis and
                 further investigation is needed to find how to improve verifier
                 pruning in such cases.
      
      With the above LLVM patch, for the new loop6.c test, which has smaller loop
      bound compared to original test, I got:
      
        $ test_progs -s -n 10/16
        ...
        stack depth 64
        processed 390735 insns (limit 1000000) max_states_per_insn 87
            total_states 8658 peak_states 964 mark_read 6
        #10/16 loop6.o:OK
      
      Use the original loop bound, i.e., commenting out "#define WORKAROUND", I got:
      
        $ test_progs -s -n 10/16
        ...
        BPF program is too large. Processed 1000001 insn
        stack depth 64
        processed 1000001 insns (limit 1000000) max_states_per_insn 91
            total_states 23176 peak_states 5069 mark_read 6
        ...
        #10/16 loop6.o:FAIL
      
      The purpose of this patch is to provide a regression test for the above LLVM fix
      and also provide a test case for further analyzing the verifier pruning issue.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Zhenwei Pi <pizhenwei@bytedance.com>
      Link: https://lore.kernel.org/bpf/20210226223810.236472-1-yhs@fb.com
      86a35af6