1. 10 Apr, 2019 6 commits
    • Daniel Borkmann's avatar
      bpf: add specification for BTF Var and DataSec kinds · f063c889
      Daniel Borkmann authored
      This adds the BTF specification and UAPI bits for supporting BTF Var
      and DataSec kinds. This is following LLVM upstream commit ac4082b77e07
      ("[BPF] Add BTF Var and DataSec Support") which has been merged recently.
      Var itself is for describing a global variable and DataSec to describe
      ELF sections e.g. data/bss/rodata sections that hold one or multiple
      global variables.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f063c889
    • Daniel Borkmann's avatar
      bpf: allow . char as part of the object name · 3e0ddc4f
      Daniel Borkmann authored
      Trivial addition to allow '.' aside from '_' as "special" characters
      in the object name. Used to allow for substrings in maps from loader
      side such as ".bss", ".data", ".rodata", but could also be useful for
      other purposes.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3e0ddc4f
    • Daniel Borkmann's avatar
      bpf: add syscall side map freeze support · 87df15de
      Daniel Borkmann authored
      This patch adds a new BPF_MAP_FREEZE command which allows to
      "freeze" the map globally as read-only / immutable from syscall
      side.
      
      Map permission handling has been refactored into map_get_sys_perms()
      and drops FMODE_CAN_WRITE in case of locked map. Main use case is
      to allow for setting up .rodata sections from the BPF ELF which
      are loaded into the kernel, meaning BPF loader first allocates
      map, sets up map value by copying .rodata section into it and once
      complete, it calls BPF_MAP_FREEZE on the map fd to prevent further
      modifications.
      
      Right now BPF_MAP_FREEZE only takes map fd as argument while remaining
      bpf_attr members are required to be zero. I didn't add write-only
      locking here as counterpart since I don't have a concrete use-case
      for it on my side, and I think it makes probably more sense to wait
      once there is actually one. In that case bpf_attr can be extended
      as usual with a flag field and/or others where flag 0 means that
      we lock the map read-only hence this doesn't prevent to add further
      extensions to BPF_MAP_FREEZE upon need.
      
      A map creation flag like BPF_F_WRONCE was not considered for couple
      of reasons: i) in case of a generic implementation, a map can consist
      of more than just one element, thus there could be multiple map
      updates needed to set the map into a state where it can then be
      made immutable, ii) WRONCE indicates exact one-time write before
      it is then set immutable. A generic implementation would set a bit
      atomically on map update entry (if unset), indicating that every
      subsequent update from then onwards will need to bail out there.
      However, map updates can fail, so upon failure that flag would need
      to be unset again and the update attempt would need to be repeated
      for it to be eventually made immutable. While this can be made
      race-free, this approach feels less clean and in combination with
      reason i), it's not generic enough. A dedicated BPF_MAP_FREEZE
      command directly sets the flag and caller has the guarantee that
      map is immutable from syscall side upon successful return for any
      future syscall invocations that would alter the map state, which
      is also more intuitive from an API point of view. A command name
      such as BPF_MAP_LOCK has been avoided as it's too close with BPF
      map spin locks (which already has BPF_F_LOCK flag). BPF_MAP_FREEZE
      is so far only enabled for privileged users.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87df15de
    • Daniel Borkmann's avatar
      bpf: add program side {rd, wr}only support for maps · 591fe988
      Daniel Borkmann authored
      This work adds two new map creation flags BPF_F_RDONLY_PROG
      and BPF_F_WRONLY_PROG in order to allow for read-only or
      write-only BPF maps from a BPF program side.
      
      Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
      applies to system call side, meaning the BPF program has full
      read/write access to the map as usual while bpf(2) calls with
      map fd can either only read or write into the map depending
      on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
      for the exact opposite such that verifier is going to reject
      program loads if write into a read-only map or a read into a
      write-only map is detected. For read-only map case also some
      helpers are forbidden for programs that would alter the map
      state such as map deletion, update, etc. As opposed to the two
      BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
      as BPF_F_WRONLY_PROG really do correspond to the map lifetime.
      
      We've enabled this generic map extension to various non-special
      maps holding normal user data: array, hash, lru, lpm, local
      storage, queue and stack. Further generic map types could be
      followed up in future depending on use-case. Main use case
      here is to forbid writes into .rodata map values from verifier
      side.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      591fe988
    • Daniel Borkmann's avatar
      bpf: do not retain flags that are not tied to map lifetime · be70bcd5
      Daniel Borkmann authored
      Both BPF_F_WRONLY / BPF_F_RDONLY flags are tied to the map file
      descriptor, but not to the map object itself! Meaning, at map
      creation time BPF_F_RDONLY can be set to make the map read-only
      from syscall side, but this holds only for the returned fd, so
      any other fd either retrieved via bpf file system or via map id
      for the very same underlying map object can have read-write access
      instead.
      
      Given that, keeping the two flags around in the map_flags attribute
      and exposing them to user space upon map dump is misleading and
      may lead to false conclusions. Since these two flags are not
      tied to the map object lets also not store them as map property.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      be70bcd5
    • Daniel Borkmann's avatar
      bpf: implement lookup-free direct value access for maps · d8eca5bb
      Daniel Borkmann authored
      This generic extension to BPF maps allows for directly loading
      an address residing inside a BPF map value as a single BPF
      ldimm64 instruction!
      
      The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
      is a special src_reg flag for ldimm64 instruction that indicates
      that inside the first part of the double insns's imm field is a
      file descriptor which the verifier then replaces as a full 64bit
      address of the map into both imm parts. For the newly added
      BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
      the first part of the double insns's imm field is again a file
      descriptor corresponding to the map, and the second part of the
      imm field is an offset into the value. The verifier will then
      replace both imm parts with an address that points into the BPF
      map value at the given value offset for maps that support this
      operation. Currently supported is array map with single entry.
      It is possible to support more than just single map element by
      reusing both 16bit off fields of the insns as a map index, so
      full array map lookup could be expressed that way. It hasn't
      been implemented here due to lack of concrete use case, but
      could easily be done so in future in a compatible way, since
      both off fields right now have to be 0 and would correctly
      denote a map index 0.
      
      The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
      BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
      map pointer versus load of map's value at offset 0, and changing
      BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
      regular map pointer and map value pointer would add unnecessary
      complexity and increases barrier for debugability thus less
      suitable. Using the second part of the imm field as an offset
      into the value does /not/ come with limitations since maximum
      possible value size is in u32 universe anyway.
      
      This optimization allows for efficiently retrieving an address
      to a map value memory area without having to issue a helper call
      which needs to prepare registers according to calling convention,
      etc, without needing the extra NULL test, and without having to
      add the offset in an additional instruction to the value base
      pointer. The verifier then treats the destination register as
      PTR_TO_MAP_VALUE with constant reg->off from the user passed
      offset from the second imm field, and guarantees that this is
      within bounds of the map value. Any subsequent operations are
      normally treated as typical map value handling without anything
      extra needed from verification side.
      
      The two map operations for direct value access have been added to
      array map for now. In future other types could be supported as
      well depending on the use case. The main use case for this commit
      is to allow for BPF loader support for global variables that
      reside in .data/.rodata/.bss sections such that we can directly
      load the address of them with minimal additional infrastructure
      required. Loader support has been added in subsequent commits for
      libbpf library.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d8eca5bb
  2. 07 Apr, 2019 1 commit
  3. 05 Apr, 2019 9 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-varstack-fixes' · 347807d3
      Daniel Borkmann authored
      Andrey Ignatov says:
      
      ====================
      v2->v3:
      - sanity check max value for variable offset.
      
      v1->v2:
      - rely on meta = NULL to reject var_off stack access to uninit buffer.
      
      This patch set is a follow-up for discussion [1].
      
      It fixes variable offset stack access handling for raw and unprivileged
      mode, rejecting both of them, and sanity checks max variable offset value.
      
      Patch 1 handles raw (uninitialized) mode.
      Patch 2 adds test for raw mode.
      Patch 3 handles unprivileged mode.
      Patch 4 adds test for unprivileged mode.
      Patch 5 adds sanity check for max value of variable offset.
      Patch 6 adds test for variable offset max value checking.
      Patch 7 is a minor fix in verbose log.
      
      Unprivileged mode is an interesting case since one (and only?) way to come
      up with variable offset is to use pointer arithmetics. Though pointer
      arithmetics is already prohibited for unprivileged mode. I'm not sure if
      it's enough though and it seems like a good idea to still reject variable
      offset for unpriv in check_stack_boundary(). Please see patches 3 and 4
      for more details on this.
      
      [1] https://marc.info/?l=linux-netdev&m=155419526427742&w=2
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      347807d3
    • Andrey Ignatov's avatar
      bpf: Add missed newline in verifier verbose log · 1fbd20f8
      Andrey Ignatov authored
      check_stack_access() that prints verbose log is used in
      adjust_ptr_min_max_vals() that prints its own verbose log and now they
      stick together, e.g.:
      
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16
        size=1R2 stack pointer arithmetic goes out of range, prohibited for
        !root
      
      Add missing newline so that log is more readable:
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1
        R2 stack pointer arithmetic goes out of range, prohibited for !root
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1fbd20f8
    • Andrey Ignatov's avatar
      selftests/bpf: Test unbounded var_off stack access · 07f91962
      Andrey Ignatov authored
      Test the case when reg->smax_value is too small/big and can overflow,
      and separately min and max values outside of stack bounds.
      
      Example of output:
        # ./test_verifier
        #856/p indirect variable-offset stack access, unbounded OK
        #857/p indirect variable-offset stack access, max out of bound OK
        #858/p indirect variable-offset stack access, min out of bound OK
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      07f91962
    • Andrey Ignatov's avatar
      bpf: Sanity check max value for var_off stack access · 107c26a7
      Andrey Ignatov authored
      As discussed in [1] max value of variable offset has to be checked for
      overflow on stack access otherwise verifier would accept code like this:
      
        0: (b7) r2 = 6
        1: (b7) r3 = 28
        2: (7a) *(u64 *)(r10 -16) = 0
        3: (7a) *(u64 *)(r10 -8) = 0
        4: (79) r4 = *(u64 *)(r1 +168)
        5: (c5) if r4 s< 0x0 goto pc+4
         R1=ctx(id=0,off=0,imm=0) R2=inv6 R3=inv28
         R4=inv(id=0,umax_value=9223372036854775807,var_off=(0x0;
         0x7fffffffffffffff)) R10=fp0,call_-1 fp-8=mmmmmmmm fp-16=mmmmmmmm
        6: (17) r4 -= 16
        7: (0f) r4 += r10
        8: (b7) r5 = 8
        9: (85) call bpf_getsockopt#57
        10: (b7) r0 = 0
        11: (95) exit
      
      , where R4 obviosly has unbounded max value.
      
      Fix it by checking that reg->smax_value is inside (-BPF_MAX_VAR_OFF;
      BPF_MAX_VAR_OFF) range.
      
      reg->smax_value is used instead of reg->umax_value because stack
      pointers are calculated using negative offset from fp. This is opposite
      to e.g. map access where offset must be non-negative and where
      umax_value is used.
      
      Also dedicated verbose logs are added for both min and max bound check
      failures to have diagnostics consistent with variable offset handling in
      check_map_access().
      
      [1] https://marc.info/?l=linux-netdev&m=155433357510597&w=2
      
      Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
      Reported-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      107c26a7
    • Andrey Ignatov's avatar
      selftests/bpf: Test indirect var_off stack access in unpriv mode · 2c6927db
      Andrey Ignatov authored
      Test that verifier rejects indirect stack access with variable offset in
      unprivileged mode and accepts same code in privileged mode.
      
      Since pointer arithmetics is prohibited in unprivileged mode verifier
      should reject the program even before it gets to helper call that uses
      variable offset, at the time when that variable offset is trying to be
      constructed.
      
      Example of output:
        # ./test_verifier
        ...
        #859/u indirect variable-offset stack access, priv vs unpriv OK
        #859/p indirect variable-offset stack access, priv vs unpriv OK
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2c6927db
    • Andrey Ignatov's avatar
      bpf: Reject indirect var_off stack access in unpriv mode · 088ec26d
      Andrey Ignatov authored
      Proper support of indirect stack access with variable offset in
      unprivileged mode (!root) requires corresponding support in Spectre
      masking for stack ALU in retrieve_ptr_limit().
      
      There are no use-case for variable offset in unprivileged mode though so
      make verifier reject such accesses for simplicity.
      
      Pointer arithmetics is one (and only?) way to cause variable offset and
      it's already rejected in unpriv mode so that verifier won't even get to
      helper function whose argument contains variable offset, e.g.:
      
        0: (7a) *(u64 *)(r10 -16) = 0
        1: (7a) *(u64 *)(r10 -8) = 0
        2: (61) r2 = *(u32 *)(r1 +0)
        3: (57) r2 &= 4
        4: (17) r2 -= 16
        5: (0f) r2 += r10
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1R2
        stack pointer arithmetic goes out of range, prohibited for !root
      
      Still it looks like a good idea to reject variable offset indirect stack
      access for unprivileged mode in check_stack_boundary() explicitly.
      
      Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
      Reported-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      088ec26d
    • Andrey Ignatov's avatar
      selftests/bpf: Test indirect var_off stack access in raw mode · f68a5b44
      Andrey Ignatov authored
      Test that verifier rejects indirect access to uninitialized stack with
      variable offset.
      
      Example of output:
        # ./test_verifier
        ...
        #859/p indirect variable-offset stack access, uninitialized OK
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f68a5b44
    • Andrey Ignatov's avatar
      bpf: Reject indirect var_off stack access in raw mode · f2bcd05e
      Andrey Ignatov authored
      It's hard to guarantee that whole memory is marked as initialized on
      helper return if uninitialized stack is accessed with variable offset
      since specific bounds are unknown to verifier. This may cause
      uninitialized stack leaking.
      
      Reject such an access in check_stack_boundary to prevent possible
      leaking.
      
      There are no known use-cases for indirect uninitialized stack access
      with variable offset so it shouldn't break anything.
      
      Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
      Reported-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f2bcd05e
    • Alexei Starovoitov's avatar
      samples/bpf: fix build with new clang · 636e78b1
      Alexei Starovoitov authored
      clang started to error on invalid asm clobber usage in x86 headers
      and many bpf program samples failed to build with the message:
      
        CLANG-bpf  /data/users/ast/bpf-next/samples/bpf/xdp_redirect_kern.o
      In file included from /data/users/ast/bpf-next/samples/bpf/xdp_redirect_kern.c:14:
      In file included from ../include/linux/in.h:23:
      In file included from ../include/uapi/linux/in.h:24:
      In file included from ../include/linux/socket.h:8:
      In file included from ../include/linux/uio.h:14:
      In file included from ../include/crypto/hash.h:16:
      In file included from ../include/linux/crypto.h:26:
      In file included from ../include/linux/uaccess.h:5:
      In file included from ../include/linux/sched.h:15:
      In file included from ../include/linux/sem.h:5:
      In file included from ../include/uapi/linux/sem.h:5:
      In file included from ../include/linux/ipc.h:9:
      In file included from ../include/linux/refcount.h:72:
      ../arch/x86/include/asm/refcount.h:72:36: error: asm-specifier for input or output variable conflicts with asm clobber list
                                               r->refs.counter, e, "er", i, "cx");
                                                                            ^
      ../arch/x86/include/asm/refcount.h:86:27: error: asm-specifier for input or output variable conflicts with asm clobber list
                                               r->refs.counter, e, "cx");
                                                                   ^
      2 errors generated.
      
      Override volatile() to workaround the problem.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      636e78b1
  4. 04 Apr, 2019 2 commits
  5. 03 Apr, 2019 11 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-verifier-scalability' · cc441a69
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      v1->v2:
      - fixed typo in patch 1
      - added a patch to convert kcalloc to kvcalloc
      - added a patch to verbose 16-bit jump offset check
      - added a test with 1m insns
      
      This patch set is the first step to be able to accept large programs.
      The verifier still suffers from its brute force algorithm and
      large programs can easily hit 1M insn_processed limit.
      A lot more work is necessary to be able to verify large programs.
      
      v1:
      Realize two key ideas to speed up verification speed by ~20 times
      1. every 'branching' instructions records all verifier states.
         not all of them are useful for search pruning.
         add a simple heuristic to keep states that were successful in search pruning
         and remove those that were not
      2. mark_reg_read walks parentage chain of registers to mark parents as LIVE_READ.
         Once the register is marked there is no need to remark it again in the future.
         Hence stop walking the chain once first LIVE_READ is seen.
      
      1st optimization gives 10x speed up on large programs
      and 2nd optimization reduces the cost of mark_reg_read from ~40% of cpu to <1%.
      Combined the deliver ~20x speedup on large programs.
      
      Faster and bounded verification time allows to increase insn_processed
      limit to 1 million from 130k.
      Worst case it takes 1/10 of a second to process that many instructions
      and peak memory consumption is peak_states * sizeof(struct bpf_verifier_state)
      which is around ~5Mbyte.
      
      Increase insn_per_program limit for root to insn_processed limit.
      
      Add verification stats and stress tests for verifier scalability.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      cc441a69
    • Alexei Starovoitov's avatar
      selftests/bpf: synthetic tests to push verifier limits · 8aa2d4b4
      Alexei Starovoitov authored
      Add a test to generate 1m ld_imm64 insns to stress the verifier.
      
      Bump the size of fill_ld_abs_vlan_push_pop test from 4k to 29k
      and jump_around_ld_abs from 4k to 5.5k.
      Larger sizes are not possible due to 16-bit offset encoding
      in jump instructions.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      8aa2d4b4
    • Alexei Starovoitov's avatar
      selftests/bpf: add few verifier scale tests · e5e7a8f2
      Alexei Starovoitov authored
      Add 3 basic tests that stress verifier scalability.
      
      test_verif_scale1.c calls non-inlined jhash() function 90 times on
      different position in the packet.
      This test simulates network packet parsing.
      jhash function is ~140 instructions and main program is ~1200 insns.
      
      test_verif_scale2.c force inlines jhash() function 90 times.
      This program is ~15k instructions long.
      
      test_verif_scale3.c calls non-inlined jhash() function 90 times on
      But this time jhash has to process 32-bytes from the packet
      instead of 14-bytes in tests 1 and 2.
      jhash function is ~230 insns and main program is ~1200 insns.
      
      $ test_progs -s
      can be used to see verifier stats.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e5e7a8f2
    • Alexei Starovoitov's avatar
      libbpf: teach libbpf about log_level bit 2 · da11b417
      Alexei Starovoitov authored
      Allow bpf_prog_load_xattr() to specify log_level for program loading.
      
      Teach libbpf to accept log_level with bit 2 set.
      
      Increase default BPF_LOG_BUF_SIZE from 256k to 16M.
      There is no downside to increase it to a maximum allowed by old kernels.
      Existing 256k limit caused ENOSPC errors and users were not able to see
      verifier error which is printed at the end of the verifier log.
      
      If ENOSPC is hit, double the verifier log and try again to capture
      the verifier error.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      da11b417
    • Alexei Starovoitov's avatar
      bpf: increase verifier log limit · 7a9f5c65
      Alexei Starovoitov authored
      The existing 16Mbyte verifier log limit is not enough for log_level=2
      even for small programs. Increase it to 1Gbyte.
      Note it's not a kernel memory limit.
      It's an amount of memory user space provides to store
      the verifier log. The kernel populates it 1k at a time.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7a9f5c65
    • Alexei Starovoitov's avatar
      bpf: increase complexity limit and maximum program size · c04c0d2b
      Alexei Starovoitov authored
      Large verifier speed improvements allow to increase
      verifier complexity limit.
      Now regardless of the program composition and its size it takes
      little time for the verifier to hit insn_processed limit.
      On typical x86 machine non-debug kernel processes 1M instructions
      in 1/10 of a second.
      (before these speed improvements specially crafted programs
      could be hitting multi-second verification times)
      Full kasan kernel with debug takes ~1 second for the same 1M insns.
      Hence bump the BPF_COMPLEXITY_LIMIT_INSNS limit to 1M.
      Also increase the number of instructions per program
      from 4k to internal BPF_COMPLEXITY_LIMIT_INSNS limit.
      4k limit was confusing to users, since small programs with hundreds
      of insns could be hitting BPF_COMPLEXITY_LIMIT_INSNS limit.
      Sometimes adding more insns and bpf_trace_printk debug statements
      would make the verifier accept the program while removing
      code would make the verifier reject it.
      Some user space application started to add #define MAX_FOO to
      their programs and do:
        MAX_FOO=100;
      again:
        compile with MAX_FOO;
        try to load;
        if (fails_to_load) { reduce MAX_FOO; goto again; }
      to be able to fit maximum amount of processing into single program.
      Other users artificially split their single program into a set of programs
      and use all 32 iterations of tail_calls to increase compute limits.
      And the most advanced folks used unlimited tc-bpf filter list
      to execute many bpf programs.
      Essentially the users managed to workaround 4k insn limit.
      This patch removes the limit for root programs from uapi.
      BPF_COMPLEXITY_LIMIT_INSNS is the kernel internal limit
      and success to load the program no longer depends on program size,
      but on 'smartness' of the verifier only.
      The verifier will continue to get smarter with every kernel release.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c04c0d2b
    • Alexei Starovoitov's avatar
      bpf: verbose jump offset overflow check · 4f73379e
      Alexei Starovoitov authored
      Larger programs may trigger 16-bit jump offset overflow check
      during instruction patching. Make this error verbose otherwise
      users cannot decipher error code without printks in the verifier.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4f73379e
    • Alexei Starovoitov's avatar
      bpf: convert temp arrays to kvcalloc · 71dde681
      Alexei Starovoitov authored
      Temporary arrays used during program verification need to be vmalloc-ed
      to support large bpf programs.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      71dde681
    • Alexei Starovoitov's avatar
      bpf: improve verification speed by not remarking live_read · 25af32da
      Alexei Starovoitov authored
      With large verifier speed improvement brought by the previous patch
      mark_reg_read() becomes the hottest function during verification.
      On a typical program it consumes 40% of cpu.
      mark_reg_read() walks parentage chain of registers to mark parents as LIVE_READ.
      Once the register is marked there is no need to remark it again in the future.
      Hence stop walking the chain once first LIVE_READ is seen.
      This optimization drops mark_reg_read() time from 40% of cpu to <1%
      and overall 2x improvement of verification speed.
      For some programs the longest_mark_read_walk counter improves from ~500 to ~5
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      25af32da
    • Alexei Starovoitov's avatar
      bpf: improve verification speed by droping states · 9f4686c4
      Alexei Starovoitov authored
      Branch instructions, branch targets and calls in a bpf program are
      the places where the verifier remembers states that led to successful
      verification of the program.
      These states are used to prune brute force program analysis.
      For unprivileged programs there is a limit of 64 states per such
      'branching' instructions (maximum length is tracked by max_states_per_insn
      counter introduced in the previous patch).
      Simply reducing this threshold to 32 or lower increases insn_processed
      metric to the point that small valid programs get rejected.
      For root programs there is no limit and cilium programs can have
      max_states_per_insn to be 100 or higher.
      Walking 100+ states multiplied by number of 'branching' insns during
      verification consumes significant amount of cpu time.
      Turned out simple LRU-like mechanism can be used to remove states
      that unlikely will be helpful in future search pruning.
      This patch introduces hit_cnt and miss_cnt counters:
      hit_cnt - this many times this state successfully pruned the search
      miss_cnt - this many times this state was not equivalent to other states
      (and that other states were added to state list)
      
      The heuristic introduced in this patch is:
      if (sl->miss_cnt > sl->hit_cnt * 3 + 3)
        /* drop this state from future considerations */
      
      Higher numbers increase max_states_per_insn (allow more states to be
      considered for pruning) and slow verification speed, but do not meaningfully
      reduce insn_processed metric.
      Lower numbers drop too many states and insn_processed increases too much.
      Many different formulas were considered.
      This one is simple and works well enough in practice.
      (the analysis was done on selftests/progs/* and on cilium programs)
      
      The end result is this heuristic improves verification speed by 10 times.
      Large synthetic programs that used to take a second more now take
      1/10 of a second.
      In cases where max_states_per_insn used to be 100 or more, now it's ~10.
      
      There is a slight increase in insn_processed for cilium progs:
                             before   after
      bpf_lb-DLB_L3.o 	1831	1838
      bpf_lb-DLB_L4.o 	3029	3218
      bpf_lb-DUNKNOWN.o 	1064	1064
      bpf_lxc-DDROP_ALL.o	26309	26935
      bpf_lxc-DUNKNOWN.o	33517	34439
      bpf_netdev.o		9713	9721
      bpf_overlay.o		6184	6184
      bpf_lcx_jit.o		37335	39389
      And 2-3 times improvement in the verification speed.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9f4686c4
    • Alexei Starovoitov's avatar
      bpf: add verifier stats and log_level bit 2 · 06ee7115
      Alexei Starovoitov authored
      In order to understand the verifier bottlenecks add various stats
      and extend log_level:
      log_level 1 and 2 are kept as-is:
      bit 0 - level=1 - print every insn and verifier state at branch points
      bit 1 - level=2 - print every insn and verifier state at every insn
      bit 2 - level=4 - print verifier error and stats at the end of verification
      
      When verifier rejects the program the libbpf is trying to load the program twice.
      Once with log_level=0 (no messages, only error code is reported to user space)
      and second time with log_level=1 to tell the user why the verifier rejected it.
      
      With introduction of bit 2 - level=4 the libbpf can choose to always use that
      level and load programs once, since the verification speed is not affected and
      in case of error the verbose message will be available.
      
      Note that the verifier stats are not part of uapi just like all other
      verbose messages. They're expected to change in the future.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      06ee7115
  6. 02 Apr, 2019 6 commits
  7. 01 Apr, 2019 1 commit
    • Yonghong Song's avatar
      bpf: add bpffs multi-dimensional array tests in test_btf · 9de2640b
      Yonghong Song authored
      For multiple dimensional arrays like below,
        int a[2][3]
      both llvm and pahole generated one BTF_KIND_ARRAY type like
        . element_type: int
        . index_type: unsigned int
        . number of elements: 6
      
      Such a collapsed BTF_KIND_ARRAY type will cause the divergence
      in BTF vs. the user code. In the compile-once-run-everywhere
      project, the header file is generated from BTF and used for bpf
      program, and the definition in the header file will be different
      from what user expects.
      
      But the kernel actually supports chained multi-dimensional array
      types properly. The above "int a[2][3]" can be represented as
        Type #n:
          . element_type: int
          . index_type: unsigned int
          . number of elements: 3
        Type #(n+1):
          . element_type: type #n
          . index_type: unsigned int
          . number of elements: 2
      
      The following llvm commit
        https://reviews.llvm.org/rL357215
      also enables llvm to generated proper chained multi-dimensional arrays.
      
      The test_btf already has a raw test ("struct test #1") for chained
      multi-dimensional arrays. This patch added amended bpffs test for
      chained multi-dimensional arrays.
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9de2640b
  8. 29 Mar, 2019 3 commits
    • Alexei Starovoitov's avatar
      Merge branch 'variable-stack-access' · c3969de8
      Alexei Starovoitov authored
      Andrey Ignatov says:
      
      ====================
      The patch set adds support for stack access with variable offset from helpers.
      
      Patch 1 is the main patch in the set and provides more details.
      Patch 2 adds selftests for new functionality.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c3969de8
    • Andrey Ignatov's avatar
      selftests/bpf: Test variable offset stack access · 8ff80e96
      Andrey Ignatov authored
      Test different scenarios of indirect variable-offset stack access: out of
      bound access (>0), min_off below initialized part of the stack,
      max_off+size above initialized part of the stack, initialized stack.
      
      Example of output:
        ...
        #856/p indirect variable-offset stack access, out of bound OK
        #857/p indirect variable-offset stack access, max_off+size > max_initialized OK
        #858/p indirect variable-offset stack access, min_off < min_initialized OK
        #859/p indirect variable-offset stack access, ok OK
        ...
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8ff80e96
    • Andrey Ignatov's avatar
      bpf: Support variable offset stack access from helpers · 2011fccf
      Andrey Ignatov authored
      Currently there is a difference in how verifier checks memory access for
      helper arguments for PTR_TO_MAP_VALUE and PTR_TO_STACK with regard to
      variable part of offset.
      
      check_map_access, that is used for PTR_TO_MAP_VALUE, can handle variable
      offsets just fine, so that BPF program can call a helper like this:
      
        some_helper(map_value_ptr + off, size);
      
      , where offset is unknown at load time, but is checked by program to be
      in a safe rage (off >= 0 && off + size < map_value_size).
      
      But it's not the case for check_stack_boundary, that is used for
      PTR_TO_STACK, and same code with pointer to stack is rejected by
      verifier:
      
        some_helper(stack_value_ptr + off, size);
      
      For example:
        0: (7a) *(u64 *)(r10 -16) = 0
        1: (7a) *(u64 *)(r10 -8) = 0
        2: (61) r2 = *(u32 *)(r1 +0)
        3: (57) r2 &= 4
        4: (17) r2 -= 16
        5: (0f) r2 += r10
        6: (18) r1 = 0xffff888111343a80
        8: (85) call bpf_map_lookup_elem#1
        invalid variable stack read R2 var_off=(0xfffffffffffffff0; 0x4)
      
      Add support for variable offset access to check_stack_boundary so that
      if offset is checked by program to be in a safe range it's accepted by
      verifier.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2011fccf
  9. 28 Mar, 2019 1 commit