1. 19 Nov, 2019 3 commits
  2. 18 Nov, 2019 9 commits
    • Yonghong Song's avatar
      selftests, bpf: Workaround an alu32 sub-register spilling issue · 2ea2612b
      Yonghong Song authored
      Currently, with latest llvm trunk, selftest test_progs failed obj
      file test_seg6_loop.o with the following error in verifier:
      
        infinite loop detected at insn 76
      
      The byte code sequence looks like below, and noted that alu32 has been
      turned off by default for better generated codes in general:
      
            48:       w3 = 100
            49:       *(u32 *)(r10 - 68) = r3
            ...
        ;             if (tlv.type == SR6_TLV_PADDING) {
            76:       if w3 == 5 goto -18 <LBB0_19>
            ...
            85:       r1 = *(u32 *)(r10 - 68)
        ;     for (int i = 0; i < 100; i++) {
            86:       w1 += -1
            87:       if w1 == 0 goto +5 <LBB0_20>
            88:       *(u32 *)(r10 - 68) = r1
      
      The main reason for verification failure is due to partial spills at
      r10 - 68 for induction variable "i".
      
      Current verifier only handles spills with 8-byte values. The above 4-byte
      value spill to stack is treated to STACK_MISC and its content is not
      saved. For the above example:
      
          w3 = 100
            R3_w=inv100 fp-64_w=inv1086626730498
          *(u32 *)(r10 - 68) = r3
            R3_w=inv100 fp-64_w=inv1086626730498
          ...
          r1 = *(u32 *)(r10 - 68)
            R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
            fp-64=inv1086626730498
      
      To resolve this issue, verifier needs to be extended to track sub-registers
      in spilling, or llvm needs to enhanced to prevent sub-register spilling
      in register allocation phase. The former will increase verifier complexity
      and the latter will need some llvm "hacking".
      
      Let us workaround this issue by declaring the induction variable as "long"
      type so spilling will happen at non sub-register level. We can revisit this
      later if sub-register spilling causes similar or other verification issues.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20191117214036.1309510-1-yhs@fb.com
      2ea2612b
    • Jiri Benc's avatar
      selftests, bpf: Fix test_tc_tunnel hanging · 3b054b71
      Jiri Benc authored
      When run_kselftests.sh is run, it hangs after test_tc_tunnel.sh. The reason
      is test_tc_tunnel.sh ensures the server ('nc -l') is run all the time,
      starting it again every time it is expected to terminate. The exception is
      the final client_connect: the server is not started anymore, which ensures
      no process is kept running after the test is finished.
      
      For a sit test, though, the script is terminated prematurely without the
      final client_connect and the 'nc' process keeps running. This in turn causes
      the run_one function in kselftest/runner.sh to hang forever, waiting for the
      runaway process to finish.
      
      Ensure a remaining server is terminated on cleanup.
      
      Fixes: f6ad6acc ("selftests/bpf: expand test_tc_tunnel with SIT encap")
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/bpf/60919291657a9ee89c708d8aababc28ebe1420be.1573821780.git.jbenc@redhat.com
      3b054b71
    • Jiri Benc's avatar
      selftests, bpf: xdping is not meant to be run standalone · 56bf877a
      Jiri Benc authored
      The actual test to run is test_xdping.sh, which is already in TEST_PROGS.
      The xdping program alone is not runnable with 'make run_tests', it
      immediatelly fails due to missing arguments.
      
      Move xdping to TEST_GEN_PROGS_EXTENDED in order to be built but not run.
      
      Fixes: cd538502 ("selftests/bpf: measure RTT from xdp using xdping")
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/4365c81198f62521344c2215909634407184387e.1573821726.git.jbenc@redhat.com
      56bf877a
    • Daniel Borkmann's avatar
      Merge branch 'bpf-array-mmap' · b97e12e5
      Daniel Borkmann authored
      Andrii Nakryiko says:
      
      ====================
      This patch set adds ability to memory-map BPF array maps (single- and
      multi-element). The primary use case is memory-mapping BPF array maps, created
      to back global data variables, created by libbpf implicitly. This allows for
      much better usability, along with avoiding syscalls to read or update data
      completely.
      
      Due to memory-mapping requirements, BPF array map that is supposed to be
      memory-mapped, has to be created with special BPF_F_MMAPABLE attribute, which
      triggers slightly different memory allocation strategy internally. See
      patch 1 for details.
      
      Libbpf is extended to detect kernel support for this flag, and if supported,
      will specify it for all global data maps automatically.
      
      Patch #1 refactors bpf_map_inc() and converts bpf_map's refcnt to atomic64_t
      to make refcounting never fail. Patch #2 does similar refactoring for
      bpf_prog_add()/bpf_prog_inc().
      
      v5->v6:
      - add back uref counting (Daniel);
      
      v4->v5:
      - change bpf_prog's refcnt to atomic64_t (Daniel);
      
      v3->v4:
      - add mmap's open() callback to fix refcounting (Johannes);
      - switch to remap_vmalloc_pages() instead of custom fault handler (Johannes);
      - converted bpf_map's refcnt/usercnt into atomic64_t;
      - provide default bpf_map_default_vmops handling open/close properly;
      
      v2->v3:
      - change allocation strategy to avoid extra pointer dereference (Jakub);
      
      v1->v2:
      - fix map lookup code generation for BPF_F_MMAPABLE case;
      - prevent BPF_F_MMAPABLE flag for all but plain array map type;
      - centralize ref-counting in generic bpf_map_mmap();
      - don't use uref counting (Alexei);
      - use vfree() directly;
      - print flags with %x (Song);
      - extend tests to verify bpf_map_{lookup,update}_elem() logic as well.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b97e12e5
    • Andrii Nakryiko's avatar
      selftests/bpf: Add BPF_TYPE_MAP_ARRAY mmap() tests · 5051b384
      Andrii Nakryiko authored
      Add selftests validating mmap()-ing BPF array maps: both single-element and
      multi-element ones. Check that plain bpf_map_update_elem() and
      bpf_map_lookup_elem() work correctly with memory-mapped array. Also convert
      CO-RE relocation tests to use memory-mapped views of global data.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-6-andriin@fb.com
      5051b384
    • Andrii Nakryiko's avatar
      libbpf: Make global data internal arrays mmap()-able, if possible · 7fe74b43
      Andrii Nakryiko authored
      Add detection of BPF_F_MMAPABLE flag support for arrays and add it as an extra
      flag to internal global data maps, if supported by kernel. This allows users
      to memory-map global data and use it without BPF map operations, greatly
      simplifying user experience.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-5-andriin@fb.com
      7fe74b43
    • Andrii Nakryiko's avatar
      bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY · fc970227
      Andrii Nakryiko authored
      Add ability to memory-map contents of BPF array map. This is extremely useful
      for working with BPF global data from userspace programs. It allows to avoid
      typical bpf_map_{lookup,update}_elem operations, improving both performance
      and usability.
      
      There had to be special considerations for map freezing, to avoid having
      writable memory view into a frozen map. To solve this issue, map freezing and
      mmap-ing is happening under mutex now:
        - if map is already frozen, no writable mapping is allowed;
        - if map has writable memory mappings active (accounted in map->writecnt),
          map freezing will keep failing with -EBUSY;
        - once number of writable memory mappings drops to zero, map freezing can be
          performed again.
      
      Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
      can't be memory mapped either.
      
      For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
      to be mmap()'able. We also need to make sure that array data memory is
      page-sized and page-aligned, so we over-allocate memory in such a way that
      struct bpf_array is at the end of a single page of memory with array->value
      being aligned with the start of the second page. On deallocation we need to
      accomodate this memory arrangement to free vmalloc()'ed memory correctly.
      
      One important consideration regarding how memory-mapping subsystem functions.
      Memory-mapping subsystem provides few optional callbacks, among them open()
      and close().  close() is called for each memory region that is unmapped, so
      that users can decrease their reference counters and free up resources, if
      necessary. open() is *almost* symmetrical: it's called for each memory region
      that is being mapped, **except** the very first one. So bpf_map_mmap does
      initial refcnt bump, while open() will do any extra ones after that. Thus
      number of close() calls is equal to number of open() calls plus one more.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
      fc970227
    • Andrii Nakryiko's avatar
      bpf: Convert bpf_prog refcnt to atomic64_t · 85192dbf
      Andrii Nakryiko authored
      Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
      and remove artificial 32k limit. This allows to make bpf_prog's refcounting
      non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.
      
      Validated compilation by running allyesconfig kernel build.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com
      85192dbf
    • Andrii Nakryiko's avatar
      bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails · 1e0bd5a0
      Andrii Nakryiko authored
      92117d84 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
      potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
      (32k). Due to using 32-bit counter, it's possible in practice to overflow
      refcounter and make it wrap around to 0, causing erroneous map free, while
      there are still references to it, causing use-after-free problems.
      
      But having a failing refcounting operations are problematic in some cases. One
      example is mmap() interface. After establishing initial memory-mapping, user
      is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
      splitting it into multiple non-contiguous regions. All this happening without
      any control from the users of mmap subsystem. Rather mmap subsystem sends
      notifications to original creator of memory mapping through open/close
      callbacks, which are optionally specified during initial memory mapping
      creation. These callbacks are used to maintain accurate refcount for bpf_map
      (see next patch in this series). The problem is that open() callback is not
      supposed to fail, because memory-mapped resource is set up and properly
      referenced. This is posing a problem for using memory-mapping with BPF maps.
      
      One solution to this is to maintain separate refcount for just memory-mappings
      and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
      There are similar use cases in current work on tcp-bpf, necessitating extra
      counter as well. This seems like a rather unfortunate and ugly solution that
      doesn't scale well to various new use cases.
      
      Another approach to solve this is to use non-failing refcount_t type, which
      uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
      stays there. This utlimately causes memory leak, but prevents use after free.
      
      But given refcounting is not the most performance-critical operation with BPF
      maps (it's not used from running BPF program code), we can also just switch to
      64-bit counter that can't overflow in practice, potentially disadvantaging
      32-bit platforms a tiny bit. This simplifies semantics and allows above
      described scenarios to not worry about failing refcount increment operation.
      
      In terms of struct bpf_map size, we are still good and use the same amount of
      space:
      
      BEFORE (3 cache lines, 8 bytes of padding at the end):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
      	atomic_t                   usercnt;              /*   132     4 */
      	struct work_struct work;                         /*   136    32 */
      	char                       name[16];             /*   168    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 146, holes: 1, sum holes: 38 */
      	/* padding: 8 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      AFTER (same 3 cache lines, no extra padding now):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
      	atomic64_t                 usercnt;              /*   136     8 */
      	struct work_struct work;                         /*   144    32 */
      	char                       name[16];             /*   176    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 154, holes: 1, sum holes: 38 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      This patch, while modifying all users of bpf_map_inc, also cleans up its
      interface to match bpf_map_put with separate operations for bpf_map_inc and
      bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
      respectively). Also, given there are no users of bpf_map_inc_not_zero
      specifying uref=true, remove uref flag and default to uref=false internally.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
      1e0bd5a0
  3. 15 Nov, 2019 26 commits
  4. 11 Nov, 2019 2 commits
    • Anders Roxell's avatar
      bpf, testing: Add missing object file to TEST_FILES · e47a1799
      Anders Roxell authored
      When installing kselftests to its own directory and run the
      test_lwt_ip_encap.sh it will complain that test_lwt_ip_encap.o can't be
      found. Same with the test_tc_edt.sh test it will complain that
      test_tc_edt.o can't be found.
      
        $ ./test_lwt_ip_encap.sh
        starting egress IPv4 encap test
        Error opening object test_lwt_ip_encap.o: No such file or directory
        Object hashing failed!
        Cannot initialize ELF context!
        Failed to parse eBPF program: Invalid argument
      
      Rework to add test_lwt_ip_encap.o and test_tc_edt.o to TEST_FILES so the
      object file gets installed when installing kselftest.
      
      Fixes: 74b5a596 ("selftests/bpf: Replace test_progs and test_maps w/ general rule")
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191111161728.8854-1-anders.roxell@linaro.org
      e47a1799
    • Yonghong Song's avatar
      bpf, testing: Workaround a verifier failure for test_progs · b7a0d65d
      Yonghong Song authored
      With latest llvm compiler, running test_progs will have the following
      verifier failure for test_sysctl_loop1.o:
      
        libbpf: load bpf program failed: Permission denied
        libbpf: -- BEGIN DUMP LOG ---
        libbpf:
        invalid indirect read from stack var_off (0x0; 0xff)+196 size 7
        ...
        libbpf: -- END LOG --
        libbpf: failed to load program 'cgroup/sysctl'
        libbpf: failed to load object 'test_sysctl_loop1.o'
      
      The related bytecode looks as below:
      
        0000000000000308 LBB0_8:
            97:       r4 = r10
            98:       r4 += -288
            99:       r4 += r7
           100:       w8 &= 255
           101:       r1 = r10
           102:       r1 += -488
           103:       r1 += r8
           104:       r2 = 7
           105:       r3 = 0
           106:       call 106
           107:       w1 = w0
           108:       w1 += -1
           109:       if w1 > 6 goto -24 <LBB0_5>
           110:       w0 += w8
           111:       r7 += 8
           112:       w8 = w0
           113:       if r7 != 224 goto -17 <LBB0_8>
      
      And source code:
      
           for (i = 0; i < ARRAY_SIZE(tcp_mem); ++i) {
                   ret = bpf_strtoul(value + off, MAX_ULONG_STR_LEN, 0,
                                     tcp_mem + i);
                   if (ret <= 0 || ret > MAX_ULONG_STR_LEN)
                           return 0;
                   off += ret & MAX_ULONG_STR_LEN;
           }
      
      Current verifier is not able to conclude that register w0 before '+'
      at insn 110 has a range of 1 to 7 and thinks it is from 0 - 255. This
      leads to more conservative range for w8 at insn 112, and later verifier
      complaint.
      
      Let us workaround this issue until we found a compiler and/or verifier
      solution. The workaround in this patch is to make variable 'ret' volatile,
      which will force a reload and then '&' operation to ensure better value
      range. With this patch, I got the below byte code for the loop:
      
        0000000000000328 LBB0_9:
           101:       r4 = r10
           102:       r4 += -288
           103:       r4 += r7
           104:       w8 &= 255
           105:       r1 = r10
           106:       r1 += -488
           107:       r1 += r8
           108:       r2 = 7
           109:       r3 = 0
           110:       call 106
           111:       *(u32 *)(r10 - 64) = r0
           112:       r1 = *(u32 *)(r10 - 64)
           113:       if w1 s< 1 goto -28 <LBB0_5>
           114:       r1 = *(u32 *)(r10 - 64)
           115:       if w1 s> 7 goto -30 <LBB0_5>
           116:       r1 = *(u32 *)(r10 - 64)
           117:       w1 &= 7
           118:       w1 += w8
           119:       r7 += 8
           120:       w8 = w1
           121:       if r7 != 224 goto -21 <LBB0_9>
      
      Insn 117 did the '&' operation and we got more precise value range
      for 'w8' at insn 120. The test is happy then:
      
        #3/17 test_sysctl_loop1.o:OK
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191107170045.2503480-1-yhs@fb.com
      b7a0d65d