1. 12 Dec, 2023 1 commit
  2. 11 Dec, 2023 2 commits
  3. 10 Dec, 2023 12 commits
  4. 09 Dec, 2023 6 commits
  5. 08 Dec, 2023 5 commits
    • Andrii Nakryiko's avatar
      Merge branch 'bpf-fix-accesses-to-uninit-stack-slots' · 4af20ab9
      Andrii Nakryiko authored
      Andrei Matei says:
      
      ====================
      bpf: fix accesses to uninit stack slots
      
      Fix two related issues issues around verifying stack accesses:
      1. accesses to uninitialized stack memory was allowed inconsistently
      2. the maximum stack depth needed for a program was not always
      maintained correctly
      
      The two issues are fixed together in one commit because the code for one
      affects the other.
      
      V4 to V5:
      - target bpf-next (Alexei)
      
      V3 to V4:
      - minor fixup to comment in patch 1 (Eduard)
      - C89-style in patch 3 (Andrii)
      
      V2 to V3:
      - address review comments from Andrii and Eduard
      - drop new verifier tests in favor of editing existing tests to check
        for stack depth
      - append a patch with a bit of cleanup coming out of the previous review
      ====================
      
      Link: https://lore.kernel.org/r/20231208032519.260451-1-andreimatei1@gmail.comSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      4af20ab9
    • Andrei Matei's avatar
      bpf: Minor cleanup around stack bounds · 2929bfac
      Andrei Matei authored
      Push the rounding up of stack offsets into the function responsible for
      growing the stack, rather than relying on all the callers to do it.
      Uncertainty about whether the callers did it or not tripped up people in
      a previous review.
      Signed-off-by: default avatarAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20231208032519.260451-4-andreimatei1@gmail.com
      2929bfac
    • Andrei Matei's avatar
      bpf: Fix accesses to uninit stack slots · 6b4a64ba
      Andrei Matei authored
      Privileged programs are supposed to be able to read uninitialized stack
      memory (ever since 6715df8d) but, before this patch, these accesses
      were permitted inconsistently. In particular, accesses were permitted
      above state->allocated_stack, but not below it. In other words, if the
      stack was already "large enough", the access was permitted, but
      otherwise the access was rejected instead of being allowed to "grow the
      stack". This undesired rejection was happening in two places:
      - in check_stack_slot_within_bounds()
      - in check_stack_range_initialized()
      This patch arranges for these accesses to be permitted. A bunch of tests
      that were relying on the old rejection had to change; all of them were
      changed to add also run unprivileged, in which case the old behavior
      persists. One tests couldn't be updated - global_func16 - because it
      can't run unprivileged for other reasons.
      
      This patch also fixes the tracking of the stack size for variable-offset
      reads. This second fix is bundled in the same commit as the first one
      because they're inter-related. Before this patch, writes to the stack
      using registers containing a variable offset (as opposed to registers
      with fixed, known values) were not properly contributing to the
      function's needed stack size. As a result, it was possible for a program
      to verify, but then to attempt to read out-of-bounds data at runtime
      because a too small stack had been allocated for it.
      
      Each function tracks the size of the stack it needs in
      bpf_subprog_info.stack_depth, which is maintained by
      update_stack_depth(). For regular memory accesses, check_mem_access()
      was calling update_state_depth() but it was passing in only the fixed
      part of the offset register, ignoring the variable offset. This was
      incorrect; the minimum possible value of that register should be used
      instead.
      
      This tracking is now fixed by centralizing the tracking of stack size in
      grow_stack_state(), and by lifting the calls to grow_stack_state() to
      check_stack_access_within_bounds() as suggested by Andrii. The code is
      now simpler and more convincingly tracks the correct maximum stack size.
      check_stack_range_initialized() can now rely on enough stack having been
      allocated for the access; this helps with the fix for the first issue.
      
      A few tests were changed to also check the stack depth computation. The
      one that fails without this patch is verifier_var_off:stack_write_priv_vs_unpriv.
      
      Fixes: 01f810ac ("bpf: Allow variable-offset stack access")
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Signed-off-by: default avatarAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231208032519.260451-3-andreimatei1@gmail.com
      
      Closes: https://lore.kernel.org/bpf/CABWLsev9g8UP_c3a=1qbuZUi20tGoUXoU07FPf-5FLvhOKOY+Q@mail.gmail.com/
      6b4a64ba
    • Andrei Matei's avatar
      bpf: Add some comments to stack representation · 92e1567e
      Andrei Matei authored
      Add comments to the datastructure tracking the stack state, as the
      mapping between each stack slot and where its state is stored is not
      entirely obvious.
      Signed-off-by: default avatarAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20231208032519.260451-2-andreimatei1@gmail.com
      92e1567e
    • David Vernet's avatar
      bpf: Load vmlinux btf for any struct_ops map · 8b7b0e5f
      David Vernet authored
      In libbpf, when determining whether we need to load vmlinux btf, we're
      currently (among other things) checking whether there is any struct_ops
      program present in the object. This works for most realistic struct_ops
      maps, as a struct_ops map is of course typically composed of one or more
      struct_ops programs. However, that technically need not be the case. A
      struct_ops interface could be defined which allows a map to be specified
      which one or more non-prog fields, and which provides default behavior
      if no struct_ops progs is actually provided otherwise. For sched_ext,
      for example, you technically only need to specify the name of the
      scheduler in the struct_ops map, with the core scheduler logic providing
      default behavior if no prog is actually specified.
      
      If we were to define and try to load such a struct_ops map, we would
      crash in libbpf when initializing it as obj->btf_vmlinux will be NULL:
      
      Reading symbols from minimal...
      (gdb) r
      Starting program: minimal_example
      [Thread debugging using libthread_db enabled]
      Using host libthread_db library "/usr/lib/libthread_db.so.1".
      
      Program received signal SIGSEGV, Segmentation fault.
      0x000055555558308c in btf__type_cnt (btf=0x0) at btf.c:612
      612             return btf->start_id + btf->nr_types;
      (gdb) bt
          type_name=0x5555555d99e3 "sched_ext_ops", kind=4) at btf.c:914
          kind=4) at btf.c:942
          type=0x7fffffffe558, type_id=0x7fffffffe548, ...
          data_member=0x7fffffffe568) at libbpf.c:948
          kern_btf=0x0) at libbpf.c:1017
          at libbpf.c:8059
      
      So as to account for such bare-bones struct_ops maps, let's update
      obj_needs_vmlinux_btf() to also iterate over an obj's maps and check
      whether any of them are struct_ops maps.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Link: https://lore.kernel.org/bpf/20231208061704.400463-1-void@manifault.com
      8b7b0e5f
  6. 07 Dec, 2023 12 commits
  7. 06 Dec, 2023 2 commits
    • Andrii Nakryiko's avatar
      bpf: rename MAX_BPF_LINK_TYPE into __MAX_BPF_LINK_TYPE for consistency · 7065eefb
      Andrii Nakryiko authored
      To stay consistent with the naming pattern used for similar cases in BPF
      UAPI (__MAX_BPF_ATTACH_TYPE, etc), rename MAX_BPF_LINK_TYPE into
      __MAX_BPF_LINK_TYPE.
      
      Also similar to MAX_BPF_ATTACH_TYPE and MAX_BPF_REG, add:
      
        #define MAX_BPF_LINK_TYPE __MAX_BPF_LINK_TYPE
      
      Not all __MAX_xxx enums have such #define, so I'm not sure if we should
      add it or not, but I figured I'll start with a completely backwards
      compatible way, and we can drop that, if necessary.
      
      Also adjust a selftest that used MAX_BPF_LINK_TYPE enum.
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20231206190920.1651226-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7065eefb
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-token-and-bpf-fs-based-delegation' · c35919dc
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      BPF token and BPF FS-based delegation
      
      This patch set introduces an ability to delegate a subset of BPF subsystem
      functionality from privileged system-wide daemon (e.g., systemd or any other
      container manager) through special mount options for userns-bound BPF FS to
      a *trusted* unprivileged application. Trust is the key here. This
      functionality is not about allowing unconditional unprivileged BPF usage.
      Establishing trust, though, is completely up to the discretion of respective
      privileged application that would create and mount a BPF FS instance with
      delegation enabled, as different production setups can and do achieve it
      through a combination of different means (signing, LSM, code reviews, etc),
      and it's undesirable and infeasible for kernel to enforce any particular way
      of validating trustworthiness of particular process.
      
      The main motivation for this work is a desire to enable containerized BPF
      applications to be used together with user namespaces. This is currently
      impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
      or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
      helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
      arbitrary memory, and it's impossible to ensure that they only read memory of
      processes belonging to any given namespace. This means that it's impossible to
      have a mechanically verifiable namespace-aware CAP_BPF capability, and as such
      another mechanism to allow safe usage of BPF functionality is necessary.BPF FS
      delegation mount options and BPF token derived from such BPF FS instance is
      such a mechanism. Kernel makes no assumption about what "trusted" constitutes
      in any particular case, and it's up to specific privileged applications and
      their surrounding infrastructure to decide that. What kernel provides is a set
      of APIs to setup and mount special BPF FS instanecs and derive BPF tokens from
      it. BPF FS and BPF token are both bound to its owning userns and in such a way
      are constrained inside intended container. Users can then pass BPF token FD to
      privileged bpf() syscall commands, like BPF map creation and BPF program
      loading, to perform such operations without having init userns privileged.
      
      This version incorporates feedback and suggestions ([3]) received on v3 of
      this patch set, and instead of allowing to create BPF tokens directly assuming
      capable(CAP_SYS_ADMIN), we instead enhance BPF FS to accept a few new
      delegation mount options. If these options are used and BPF FS itself is
      properly created, set up, and mounted inside the user namespaced container,
      user application is able to derive a BPF token object from BPF FS instance,
      and pass that token to bpf() syscall. As explained in patch #3, BPF token
      itself doesn't grant access to BPF functionality, but instead allows kernel to
      do namespaced capabilities checks (ns_capable() vs capable()) for CAP_BPF,
      CAP_PERFMON, CAP_NET_ADMIN, and CAP_SYS_ADMIN, as applicable. So it forms one
      half of a puzzle and allows container managers and sys admins to have safe and
      flexible configuration options: determining which containers get delegation of
      BPF functionality through BPF FS, and then which applications within such
      containers are allowed to perform bpf() commands, based on namespaces
      capabilities.
      
      Previous attempt at addressing this very same problem ([0]) attempted to
      utilize authoritative LSM approach, but was conclusively rejected by upstream
      LSM maintainers. BPF token concept is not changing anything about LSM
      approach, but can be combined with LSM hooks for very fine-grained security
      policy. Some ideas about making BPF token more convenient to use with LSM (in
      particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
      2023 presentation ([1]). E.g., an ability to specify user-provided data
      (context), which in combination with BPF LSM would allow implementing a very
      dynamic and fine-granular custom security policies on top of BPF token. In the
      interest of minimizing API surface area and discussions this was relegated to
      follow up patches, as it's not essential to the fundamental concept of
      delegatable BPF token.
      
      It should be noted that BPF token is conceptually quite similar to the idea of
      /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
      difference is the idea of using virtual anon_inode file to hold BPF token and
      allowing multiple independent instances of them, each (potentially) with its
      own set of restrictions. And also, crucially, BPF token approach is not using
      any special stateful task-scoped flags. Instead, bpf() syscall accepts
      token_fd parameters explicitly for each relevant BPF command. This addresses
      main concerns brought up during the /dev/bpf discussion, and fits better with
      overall BPF subsystem design.
      
      This patch set adds a basic minimum of functionality to make BPF token idea
      useful and to discuss API and functionality. Currently only low-level libbpf
      APIs support creating and passing BPF token around, allowing to test kernel
      functionality, but for the most part is not sufficient for real-world
      applications, which typically use high-level libbpf APIs based on `struct
      bpf_object` type. This was done with the intent to limit the size of patch set
      and concentrate on mostly kernel-side changes. All the necessary plumbing for
      libbpf will be sent as a separate follow up patch set kernel support makes it
      upstream.
      
      Another part that should happen once kernel-side BPF token is established, is
      a set of conventions between applications (e.g., systemd), tools (e.g.,
      bpftool), and libraries (e.g., libbpf) on exposing delegatable BPF FS
      instance(s) at well-defined locations to allow applications take advantage of
      this in automatic fashion without explicit code changes on BPF application's
      side. But I'd like to postpone this discussion to after BPF token concept
      lands.
      
        [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
        [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
        [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
        [3] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
      
      v11->v12:
        - enforce exact userns match in bpf_token_capable() and
          bpf_token_allow_cmd() checks, for added strictness (Christian);
      v10->v11:
        - fix BPF FS root check to disallow using bind-mounted subdirectory of BPF
          FS instance (Christian);
        - further restrict BPF_TOKEN_CREATE command to be executed from inside
          exactly the same user namespace as the one used to create BPF FS instance
          (Christian);
      v9->v10:
        - slight adjustments in LSM parts (Paul);
        - setting delegate_xxx  options require capable(CAP_SYS_ADMIN) (Christian);
        - simplify BPF_TOKEN_CREATE UAPI by accepting BPF FS FD directly (Christian);
      v8->v9:
        - fix issue in selftests due to sys/mount.h header (Jiri);
        - fix warning in doc comments in LSM hooks (kernel test robot);
      v7->v8:
        - add bpf_token_allow_cmd and bpf_token_capable hooks (Paul);
        - inline bpf_token_alloc() into bpf_token_create() to prevent accidental
          divergence with security_bpf_token_create() hook (Paul);
      v6->v7:
        - separate patches to refactor bpf_prog_alloc/bpf_map_alloc LSM hooks, as
          discussed with Paul, and now they also accept struct bpf_token;
        - added bpf_token_create/bpf_token_free to allow LSMs (SELinux,
          specifically) to set up security LSM blob (Paul);
        - last patch also wires bpf_security_struct setup by SELinux, similar to how
          it's done for BPF map/prog, though I'm not sure if that's enough, so worst
          case it's easy to drop this patch if more full fledged SELinux
          implementation will be done separately;
        - small fixes for issues caught by code reviews (Jiri, Hou);
        - fix for test_maps test that doesn't use LIBBPF_OPTS() macro (CI);
      v5->v6:
        - fix possible use of uninitialized variable in selftests (CI);
        - don't use anon_inode, instead create one from BPF FS instance (Christian);
        - don't store bpf_token inside struct bpf_map, instead pass it explicitly to
          map_check_btf(). We do store bpf_token inside prog->aux, because it's used
          during verification and even can be checked during attach time for some
          program types;
        - LSM hooks are left intact pending the conclusion of discussion with Paul
          Moore; I'd prefer to do LSM-related changes as a follow up patch set
          anyways;
      v4->v5:
        - add pre-patch unifying CAP_NET_ADMIN handling inside kernel/bpf/syscall.c
          (Paul Moore);
        - fix build warnings and errors in selftests and kernel, detected by CI and
          kernel test robot;
      v3->v4:
        - add delegation mount options to BPF FS;
        - BPF token is derived from the instance of BPF FS and associates itself
          with BPF FS' owning userns;
        - BPF token doesn't grant BPF functionality directly, it just turns
          capable() checks into ns_capable() checks within BPF FS' owning user;
        - BPF token cannot be pinned;
      v2->v3:
        - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
          BPF_OBJ_PIN for BPF token;
      v1->v2:
        - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
        - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
      ====================
      
      Link: https://lore.kernel.org/r/20231130185229.2688956-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c35919dc