1. 21 Jan, 2022 6 commits
    • Andrii Nakryiko's avatar
      libbpf: streamline low-level XDP APIs · c359821a
      Andrii Nakryiko authored
      Introduce 4 new netlink-based XDP APIs for attaching, detaching, and
      querying XDP programs:
        - bpf_xdp_attach;
        - bpf_xdp_detach;
        - bpf_xdp_query;
        - bpf_xdp_query_id.
      
      These APIs replace bpf_set_link_xdp_fd, bpf_set_link_xdp_fd_opts,
      bpf_get_link_xdp_id, and bpf_get_link_xdp_info APIs ([0]). The latter
      don't follow a consistent naming pattern and some of them use
      non-extensible approaches (e.g., struct xdp_link_info which can't be
      modified without breaking libbpf ABI).
      
      The approach I took with these low-level XDP APIs is similar to what we
      did with low-level TC APIs. There is a nice duality of bpf_tc_attach vs
      bpf_xdp_attach, and so on. I left bpf_xdp_attach() to support detaching
      when -1 is specified for prog_fd for generality and convenience, but
      bpf_xdp_detach() is preferred due to clearer naming and associated
      semantics. Both bpf_xdp_attach() and bpf_xdp_detach() accept the same
      opts struct allowing to specify expected old_prog_fd.
      
      While doing the refactoring, I noticed that old APIs require users to
      specify opts with old_fd == -1 to declare "don't care about already
      attached XDP prog fd" condition. Otherwise, FD 0 is assumed, which is
      essentially never an intended behavior. So I made this behavior
      consistent with other kernel and libbpf APIs, in which zero FD means "no
      FD". This seems to be more in line with the latest thinking in BPF land
      and should cause less user confusion, hopefully.
      
      For querying, I left two APIs, both more generic bpf_xdp_query()
      allowing to query multiple IDs and attach mode, but also
      a specialization of it, bpf_xdp_query_id(), which returns only requested
      prog_id. Uses of prog_id returning bpf_get_link_xdp_id() were so
      prevalent across selftests and samples, that it seemed a very common use
      case and using bpf_xdp_query() for doing it felt very cumbersome with
      a highly branches if/else chain based on flags and attach mode.
      
      Old APIs are scheduled for deprecation in libbpf 0.8 release.
      
        [0] Closes: https://github.com/libbpf/libbpf/issues/309Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/r/20220120061422.2710637-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c359821a
    • Alexei Starovoitov's avatar
      Merge branch 'libbpf: deprecate legacy BPF map definitions' · 1713e33b
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      
      Officially deprecate legacy BPF map definitions in libbpf. They've been slated
      for deprecation for a while in favor of more powerful BTF-defined map
      definitions and this patch set adds warnings and a way to enforce this in
      libbpf through LIBBPF_STRICT_MAP_DEFINITIONS strict mode flag.
      
      Selftests are fixed up and updated, BPF documentation is updated, bpftool's
      strict mode usage is adjusted to avoid breaking users unnecessarily.
      
      v1->v2:
        - replace missed bpf_map_def case in Documentation/bpf/btf.rst (Alexei).
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1713e33b
    • Andrii Nakryiko's avatar
      docs/bpf: update BPF map definition example · 96c85308
      Andrii Nakryiko authored
      Use BTF-defined map definition in the documentation example.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220120060529.1890907-5-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      96c85308
    • Andrii Nakryiko's avatar
      libbpf: deprecate legacy BPF map definitions · 93b8952d
      Andrii Nakryiko authored
      Enact deprecation of legacy BPF map definition in SEC("maps") ([0]). For
      the definitions themselves introduce LIBBPF_STRICT_MAP_DEFINITIONS flag
      for libbpf strict mode. If it is set, error out on any struct
      bpf_map_def-based map definition. If not set, libbpf will print out
      a warning for each legacy BPF map to raise awareness that it goes away.
      
      For any use of BPF_ANNOTATE_KV_PAIR() macro providing a legacy way to
      associate BTF key/value type information with legacy BPF map definition,
      warn through libbpf's pr_warn() error message (but don't fail BPF object
      open).
      
      BPF-side struct bpf_map_def is marked as deprecated. User-space struct
      bpf_map_def has to be used internally in libbpf, so it is left
      untouched. It should be enough for bpf_map__def() to be marked
      deprecated to raise awareness that it goes away.
      
      bpftool is an interesting case that utilizes libbpf to open BPF ELF
      object to generate skeleton. As such, even though bpftool itself uses
      full on strict libbpf mode (LIBBPF_STRICT_ALL), it has to relax it a bit
      for BPF map definition handling to minimize unnecessary disruptions. So
      opt-out of LIBBPF_STRICT_MAP_DEFINITIONS for bpftool. User's code that
      will later use generated skeleton will make its own decision whether to
      enforce LIBBPF_STRICT_MAP_DEFINITIONS or not.
      
      There are few tests in selftests/bpf that are consciously using legacy
      BPF map definitions to test libbpf functionality. For those, temporary
      opt out of LIBBPF_STRICT_MAP_DEFINITIONS mode for the duration of those
      tests.
      
        [0] Closes: https://github.com/libbpf/libbpf/issues/272Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220120060529.1890907-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      93b8952d
    • Andrii Nakryiko's avatar
      selftests/bpf: convert remaining legacy map definitions · ccc3f569
      Andrii Nakryiko authored
      Converted few remaining legacy BPF map definition to BTF-defined ones.
      For the remaining two bpf_map_def-based legacy definitions that we want
      to keep for testing purposes until libbpf 1.0 release, guard them in
      pragma to suppres deprecation warnings which will be added in libbpf in
      the next commit.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220120060529.1890907-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ccc3f569
    • Andrii Nakryiko's avatar
      selftests/bpf: fail build on compilation warning · 32b34294
      Andrii Nakryiko authored
      It's very easy to miss compilation warnings without -Werror, which is
      not set for selftests. libbpf and bpftool are already strict about this,
      so make selftests/bpf also treat compilation warnings as errors to catch
      such regressions early.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220120060529.1890907-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      32b34294
  2. 20 Jan, 2022 5 commits
  3. 19 Jan, 2022 12 commits
  4. 18 Jan, 2022 17 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: Batching iter for AF_UNIX sockets.' · 712d4793
      Alexei Starovoitov authored
      Kuniyuki Iwashima says:
      
      ====================
      
      Last year the commit afd20b92 ("af_unix: Replace the big lock with
      small locks.") landed on bpf-next.  Now we can use a batching algorithm
      for AF_UNIX bpf iter as TCP bpf iter.
      
      Changelog:
      - Add the 1st patch.
      - Call unix_get_first() in .start()/.next() to always acquire a lock in
        each iteration in the 2nd patch.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      712d4793
    • Kuniyuki Iwashima's avatar
      selftest/bpf: Fix a stale comment. · a796966b
      Kuniyuki Iwashima authored
      The commit b8a58aa6 ("af_unix: Cut unix_validate_addr() out of
      unix_mkname().") moved the bound test part into unix_validate_addr().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Link: https://lore.kernel.org/r/20220113002849.4384-6-kuniyu@amazon.co.jpSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a796966b
    • Kuniyuki Iwashima's avatar
      selftest/bpf: Test batching and bpf_(get|set)sockopt in bpf unix iter. · 7ff8985c
      Kuniyuki Iwashima authored
      This patch adds a test for the batching and bpf_(get|set)sockopt in bpf
      unix iter.
      
      It does the following.
      
        1. Creates an abstract UNIX domain socket
        2. Call bpf_setsockopt()
        3. Call bpf_getsockopt() and save the value
        4. Call setsockopt()
        5. Call getsockopt() and save the value
        6. Compare the saved values
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Link: https://lore.kernel.org/r/20220113002849.4384-5-kuniyu@amazon.co.jpSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7ff8985c
    • Kuniyuki Iwashima's avatar
      bpf: Support bpf_(get|set)sockopt() in bpf unix iter. · eb7d8f1d
      Kuniyuki Iwashima authored
      This patch makes bpf_(get|set)sockopt() available when iterating AF_UNIX
      sockets.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Link: https://lore.kernel.org/r/20220113002849.4384-4-kuniyu@amazon.co.jpSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      eb7d8f1d
    • Kuniyuki Iwashima's avatar
      bpf: af_unix: Use batching algorithm in bpf unix iter. · 855d8e77
      Kuniyuki Iwashima authored
      The commit 04c7820b ("bpf: tcp: Bpf iter batching and lock_sock")
      introduces the batching algorithm to iterate TCP sockets with more
      consistency.
      
      This patch uses the same algorithm to iterate AF_UNIX sockets.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Link: https://lore.kernel.org/r/20220113002849.4384-3-kuniyu@amazon.co.jpSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      855d8e77
    • Kuniyuki Iwashima's avatar
      af_unix: Refactor unix_next_socket(). · 4408d55a
      Kuniyuki Iwashima authored
      Currently, unix_next_socket() is overloaded depending on the 2nd argument.
      If it is NULL, unix_next_socket() returns the first socket in the hash.  If
      not NULL, it returns the next socket in the same hash list or the first
      socket in the next non-empty hash list.
      
      This patch refactors unix_next_socket() into two functions unix_get_first()
      and unix_get_next().  unix_get_first() newly acquires a lock and returns
      the first socket in the list.  unix_get_next() returns the next socket in a
      list or releases a lock and falls back to unix_get_first().
      
      In the following patch, bpf iter holds entire sockets in a list and always
      releases the lock before .show().  It always calls unix_get_first() to
      acquire a lock in each iteration.  So, this patch makes the change easier
      to follow.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Link: https://lore.kernel.org/r/20220113002849.4384-2-kuniyu@amazon.co.jpSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4408d55a
    • Alexei Starovoitov's avatar
      Merge branch 'Introduce unstable CT lookup helpers' · 2a1aff60
      Alexei Starovoitov authored
      Kumar Kartikeya says:
      
      ====================
      
      This series adds unstable conntrack lookup helpers using BPF kfunc support.  The
      patch adding the lookup helper is based off of Maxim's recent patch to aid in
      rebasing their series on top of this, all adjusted to work with module kfuncs [0].
      
        [0]: https://lore.kernel.org/bpf/20211019144655.3483197-8-maximmi@nvidia.com
      
      To enable returning a reference to struct nf_conn, the verifier is extended to
      support reference tracking for PTR_TO_BTF_ID, and kfunc is extended with support
      for working as acquire/release functions, similar to existing BPF helpers. kfunc
      returning pointer (limited to PTR_TO_BTF_ID in the kernel) can also return a
      PTR_TO_BTF_ID_OR_NULL now, typically needed when acquiring a resource can fail.
      kfunc can also receive PTR_TO_CTX and PTR_TO_MEM (with some limitations) as
      arguments now. There is also support for passing a mem, len pair as argument
      to kfunc now. In such cases, passing pointer to unsized type (void) is also
      permitted.
      
      Please see individual commits for details.
      
      Changelog:
      ----------
      v7 -> v8:
      v7: https://lore.kernel.org/bpf/20220111180428.931466-1-memxor@gmail.com
      
       * Move enum btf_kfunc_hook to btf.c (Alexei)
       * Drop verbose log for unlikely failure case in __find_kfunc_desc_btf (Alexei)
       * Remove unnecessary barrier in register_btf_kfunc_id_set (Alexei)
       * Switch macro in bpf_nf test to __always_inline function (Alexei)
      
      v6 -> v7:
      v6: https://lore.kernel.org/bpf/20220102162115.1506833-1-memxor@gmail.com
      
       * Drop try_module_get_live patch, use flag in btf_module struct (Alexei)
       * Add comments and expand commit message detailing why we have to concatenate
         and sort vmlinux kfunc BTF ID sets (Alexei)
       * Use bpf_testmod for testing btf_try_get_module race (Alexei)
       * Use bpf_prog_type for both btf_kfunc_id_set_contains and
         register_btf_kfunc_id_set calls (Alexei)
       * In case of module set registration, directly assign set (Alexei)
       * Add CONFIG_USERFAULTFD=y to selftest config
       * Fix other nits
      
      v5 -> v6:
      v5: https://lore.kernel.org/bpf/20211230023705.3860970-1-memxor@gmail.com
      
       * Fix for a bug in btf_try_get_module leading to use-after-free
       * Drop *kallsyms_on_each_symbol loop, reinstate register_btf_kfunc_id_set (Alexei)
       * btf_free_kfunc_set_tab now takes struct btf, and handles resetting tab to NULL
       * Check return value btf_name_by_offset for param_name
       * Instead of using tmp_set, use btf->kfunc_set_tab directly, and simplify cleanup
      
      v4 -> v5:
      v4: https://lore.kernel.org/bpf/20211217015031.1278167-1-memxor@gmail.com
      
       * Move nf_conntrack helpers code to its own separate file (Toke, Pablo)
       * Remove verifier callbacks, put btf_id_sets in struct btf (Alexei)
        * Convert the in-kernel users away from the old API
       * Change len__ prefix convention to __sz suffix (Alexei)
       * Drop parent_ref_obj_id patch (Alexei)
      
      v3 -> v4:
      v3: https://lore.kernel.org/bpf/20211210130230.4128676-1-memxor@gmail.com
      
       * Guard unstable CT helpers with CONFIG_DEBUG_INFO_BTF_MODULES
       * Move addition of prog_test test kfuncs to selftest commit
       * Move negative kfunc tests to test_verifier suite
       * Limit struct nesting depth to 4, which should be enough for now
      
      v2 -> v3:
      v2: https://lore.kernel.org/bpf/20211209170929.3485242-1-memxor@gmail.com
      
       * Fix build error for !CONFIG_BPF_SYSCALL (Patchwork)
      
      RFC v1 -> v2:
      v1: https://lore.kernel.org/bpf/20211030144609.263572-1-memxor@gmail.com
      
       * Limit PTR_TO_MEM support to pointer to scalar, or struct with scalars (Alexei)
       * Use btf_id_set for checking acquire, release, ret type null (Alexei)
       * Introduce opts struct for CT helpers, move int err parameter to it
       * Add l4proto as parameter to CT helper's opts, remove separate tcp/udp helpers
       * Add support for mem, len argument pair to kfunc
       * Allow void * as pointer type for mem, len argument pair
       * Extend selftests to cover new additions to kfuncs
       * Copy ref_obj_id to PTR_TO_BTF_ID dst_reg on btf_struct_access, test it
       * Fix other misc nits, bugs, and expand commit messages
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2a1aff60
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add test for race in btf_try_get_module · 46565696
      Kumar Kartikeya Dwivedi authored
      This adds a complete test case to ensure we never take references to
      modules not in MODULE_STATE_LIVE, which can lead to UAF, and it also
      ensures we never access btf->kfunc_set_tab in an inconsistent state.
      
      The test uses userfaultfd to artificially widen the race.
      
      When run on an unpatched kernel, it leads to the following splat:
      
      [root@(none) bpf]# ./test_progs -t bpf_mod_race/ksym
      [   55.498171] BUG: unable to handle page fault for address: fffffbfff802548b
      [   55.499206] #PF: supervisor read access in kernel mode
      [   55.499855] #PF: error_code(0x0000) - not-present page
      [   55.500555] PGD a4fa9067 P4D a4fa9067 PUD a4fa5067 PMD 1b44067 PTE 0
      [   55.501499] Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI
      [   55.502195] CPU: 0 PID: 83 Comm: kworker/0:2 Tainted: G           OE     5.16.0-rc4+ #151
      [   55.503388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.15.0-1 04/01/2014
      [   55.504777] Workqueue: events bpf_prog_free_deferred
      [   55.505563] RIP: 0010:kasan_check_range+0x184/0x1d0
      [   55.509140] RSP: 0018:ffff88800560fcf0 EFLAGS: 00010282
      [   55.509977] RAX: fffffbfff802548b RBX: fffffbfff802548c RCX: ffffffff9337b6ba
      [   55.511096] RDX: fffffbfff802548c RSI: 0000000000000004 RDI: ffffffffc012a458
      [   55.512143] RBP: fffffbfff802548b R08: 0000000000000001 R09: ffffffffc012a45b
      [   55.513228] R10: fffffbfff802548b R11: 0000000000000001 R12: ffff888001b5f598
      [   55.514332] R13: ffff888004f49ac8 R14: 0000000000000000 R15: ffff888092449400
      [   55.515418] FS:  0000000000000000(0000) GS:ffff888092400000(0000) knlGS:0000000000000000
      [   55.516705] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   55.517560] CR2: fffffbfff802548b CR3: 0000000007c10006 CR4: 0000000000770ef0
      [   55.518672] PKRU: 55555554
      [   55.519022] Call Trace:
      [   55.519483]  <TASK>
      [   55.519884]  module_put.part.0+0x2a/0x180
      [   55.520642]  bpf_prog_free_deferred+0x129/0x2e0
      [   55.521478]  process_one_work+0x4fa/0x9e0
      [   55.522122]  ? pwq_dec_nr_in_flight+0x100/0x100
      [   55.522878]  ? rwlock_bug.part.0+0x60/0x60
      [   55.523551]  worker_thread+0x2eb/0x700
      [   55.524176]  ? __kthread_parkme+0xd8/0xf0
      [   55.524853]  ? process_one_work+0x9e0/0x9e0
      [   55.525544]  kthread+0x23a/0x270
      [   55.526088]  ? set_kthread_struct+0x80/0x80
      [   55.526798]  ret_from_fork+0x1f/0x30
      [   55.527413]  </TASK>
      [   55.527813] Modules linked in: bpf_testmod(OE) [last unloaded: bpf_testmod]
      [   55.530846] CR2: fffffbfff802548b
      [   55.531341] ---[ end trace 1af41803c054ad6d ]---
      [   55.532136] RIP: 0010:kasan_check_range+0x184/0x1d0
      [   55.535887] RSP: 0018:ffff88800560fcf0 EFLAGS: 00010282
      [   55.536711] RAX: fffffbfff802548b RBX: fffffbfff802548c RCX: ffffffff9337b6ba
      [   55.537821] RDX: fffffbfff802548c RSI: 0000000000000004 RDI: ffffffffc012a458
      [   55.538899] RBP: fffffbfff802548b R08: 0000000000000001 R09: ffffffffc012a45b
      [   55.539928] R10: fffffbfff802548b R11: 0000000000000001 R12: ffff888001b5f598
      [   55.541021] R13: ffff888004f49ac8 R14: 0000000000000000 R15: ffff888092449400
      [   55.542108] FS:  0000000000000000(0000) GS:ffff888092400000(0000) knlGS:0000000000000000
      [   55.543260]CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   55.544136] CR2: fffffbfff802548b CR3: 0000000007c10006 CR4: 0000000000770ef0
      [   55.545317] PKRU: 55555554
      [   55.545671] note: kworker/0:2[83] exited with preempt_count 1
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220114163953.1455836-11-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      46565696
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Extend kfunc selftests · c1ff181f
      Kumar Kartikeya Dwivedi authored
      Use the prog_test kfuncs to test the referenced PTR_TO_BTF_ID kfunc
      support, and PTR_TO_CTX, PTR_TO_MEM argument passing support. Also
      testing the various failure cases for invalid kfunc prototypes.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220114163953.1455836-10-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c1ff181f
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add test_verifier support to fixup kfunc call insns · 0201b807
      Kumar Kartikeya Dwivedi authored
      This allows us to add tests (esp. negative tests) where we only want to
      ensure the program doesn't pass through the verifier, and also verify
      the error. The next commit will add the tests making use of this.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220114163953.1455836-9-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0201b807
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add test for unstable CT lookup API · 87091063
      Kumar Kartikeya Dwivedi authored
      This tests that we return errors as documented, and also that the kfunc
      calls work from both XDP and TC hooks.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220114163953.1455836-8-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87091063
    • Kumar Kartikeya Dwivedi's avatar
      net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPF · b4c2b959
      Kumar Kartikeya Dwivedi authored
      This change adds conntrack lookup helpers using the unstable kfunc call
      interface for the XDP and TC-BPF hooks. The primary usecase is
      implementing a synproxy in XDP, see Maxim's patchset [0].
      
      Export get_net_ns_by_id as nf_conntrack_bpf.c needs to call it.
      
      This object is only built when CONFIG_DEBUG_INFO_BTF_MODULES is enabled.
      
        [0]: https://lore.kernel.org/bpf/20211019144655.3483197-1-maximmi@nvidia.comSigned-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220114163953.1455836-7-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b4c2b959
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Add reference tracking support to kfunc · 5c073f26
      Kumar Kartikeya Dwivedi authored
      This patch adds verifier support for PTR_TO_BTF_ID return type of kfunc
      to be a reference, by reusing acquire_reference_state/release_reference
      support for existing in-kernel bpf helpers.
      
      We make use of the three kfunc types:
      
      - BTF_KFUNC_TYPE_ACQUIRE
        Return true if kfunc_btf_id is an acquire kfunc.  This will
        acquire_reference_state for the returned PTR_TO_BTF_ID (this is the
        only allow return value). Note that acquire kfunc must always return a
        PTR_TO_BTF_ID{_OR_NULL}, otherwise the program is rejected.
      
      - BTF_KFUNC_TYPE_RELEASE
        Return true if kfunc_btf_id is a release kfunc.  This will release the
        reference to the passed in PTR_TO_BTF_ID which has a reference state
        (from earlier acquire kfunc).
        The btf_check_func_arg_match returns the regno (of argument register,
        hence > 0) if the kfunc is a release kfunc, and a proper referenced
        PTR_TO_BTF_ID is being passed to it.
        This is similar to how helper call check uses bpf_call_arg_meta to
        store the ref_obj_id that is later used to release the reference.
        Similar to in-kernel helper, we only allow passing one referenced
        PTR_TO_BTF_ID as an argument. It can also be passed in to normal
        kfunc, but in case of release kfunc there must always be one
        PTR_TO_BTF_ID argument that is referenced.
      
      - BTF_KFUNC_TYPE_RET_NULL
        For kfunc returning PTR_TO_BTF_ID, tells if it can be NULL, hence
        force caller to mark the pointer not null (using check) before
        accessing it. Note that taking into account the case fixed by commit
        93c230e3 ("bpf: Enforce id generation for all may-be-null register type")
        we assign a non-zero id for mark_ptr_or_null_reg logic. Later, if more
        return types are supported by kfunc, which have a _OR_NULL variant, it
        might be better to move this id generation under a common
        reg_type_may_be_null check, similar to the case in the commit.
      
      Referenced PTR_TO_BTF_ID is currently only limited to kfunc, but can be
      extended in the future to other BPF helpers as well.  For now, we can
      rely on the btf_struct_ids_match check to ensure we get the pointer to
      the expected struct type. In the future, care needs to be taken to avoid
      ambiguity for reference PTR_TO_BTF_ID passed to release function, in
      case multiple candidates can release same BTF ID.
      
      e.g. there might be two release kfuncs (or kfunc and helper):
      
      foo(struct abc *p);
      bar(struct abc *p);
      
      ... such that both release a PTR_TO_BTF_ID with btf_id of struct abc. In
      this case we would need to track the acquire function corresponding to
      the release function to avoid type confusion, and store this information
      in the register state so that an incorrect program can be rejected. This
      is not a problem right now, hence it is left as an exercise for the
      future patch introducing such a case in the kernel.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220114163953.1455836-6-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5c073f26
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Introduce mem, size argument pair support for kfunc · d583691c
      Kumar Kartikeya Dwivedi authored
      BPF helpers can associate two adjacent arguments together to pass memory
      of certain size, using ARG_PTR_TO_MEM and ARG_CONST_SIZE arguments.
      Since we don't use bpf_func_proto for kfunc, we need to leverage BTF to
      implement similar support.
      
      The ARG_CONST_SIZE processing for helpers is refactored into a common
      check_mem_size_reg helper that is shared with kfunc as well. kfunc
      ptr_to_mem support follows logic similar to global functions, where
      verification is done as if pointer is not null, even when it may be
      null.
      
      This leads to a simple to follow rule for writing kfunc: always check
      the argument pointer for NULL, except when it is PTR_TO_CTX. Also, the
      PTR_TO_CTX case is also only safe when the helper expecting pointer to
      program ctx is not exposed to other programs where same struct is not
      ctx type. In that case, the type check will fall through to other cases
      and would permit passing other types of pointers, possibly NULL at
      runtime.
      
      Currently, we require the size argument to be suffixed with "__sz" in
      the parameter name. This information is then recorded in kernel BTF and
      verified during function argument checking. In the future we can use BTF
      tagging instead, and modify the kernel function definitions. This will
      be a purely kernel-side change.
      
      This allows us to have some form of backwards compatibility for
      structures that are passed in to the kernel function with their size,
      and allow variable length structures to be passed in if they are
      accompanied by a size parameter.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220114163953.1455836-5-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d583691c
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Remove check_kfunc_call callback and old kfunc BTF ID API · b202d844
      Kumar Kartikeya Dwivedi authored
      Completely remove the old code for check_kfunc_call to help it work
      with modules, and also the callback itself.
      
      The previous commit adds infrastructure to register all sets and put
      them in vmlinux or module BTF, and concatenates all related sets
      organized by the hook and the type. Once populated, these sets remain
      immutable for the lifetime of the struct btf.
      
      Also, since we don't need the 'owner' module anywhere when doing
      check_kfunc_call, drop the 'btf_modp' module parameter from
      find_kfunc_desc_btf.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220114163953.1455836-4-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b202d844
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Populate kfunc BTF ID sets in struct btf · dee872e1
      Kumar Kartikeya Dwivedi authored
      This patch prepares the kernel to support putting all kinds of kfunc BTF
      ID sets in the struct btf itself. The various kernel subsystems will
      make register_btf_kfunc_id_set call in the initcalls (for built-in code
      and modules).
      
      The 'hook' is one of the many program types, e.g. XDP and TC/SCHED_CLS,
      STRUCT_OPS, and 'types' are check (allowed or not), acquire, release,
      and ret_null (with PTR_TO_BTF_ID_OR_NULL return type).
      
      A maximum of BTF_KFUNC_SET_MAX_CNT (32) kfunc BTF IDs are permitted in a
      set of certain hook and type for vmlinux sets, since they are allocated
      on demand, and otherwise set as NULL. Module sets can only be registered
      once per hook and type, hence they are directly assigned.
      
      A new btf_kfunc_id_set_contains function is exposed for use in verifier,
      this new method is faster than the existing list searching method, and
      is also automatic. It also lets other code not care whether the set is
      unallocated or not.
      
      Note that module code can only do single register_btf_kfunc_id_set call
      per hook. This is why sorting is only done for in-kernel vmlinux sets,
      because there might be multiple sets for the same hook and type that
      must be concatenated, hence sorting them is required to ensure bsearch
      in btf_id_set_contains continues to work correctly.
      
      Next commit will update the kernel users to make use of this
      infrastructure.
      
      Finally, add __maybe_unused annotation for BTF ID macros for the
      !CONFIG_DEBUG_INFO_BTF case, so that they don't produce warnings during
      build time.
      
      The previous patch is also needed to provide synchronization against
      initialization for module BTF's kfunc_set_tab introduced here, as
      described below:
      
        The kfunc_set_tab pointer in struct btf is write-once (if we consider
        the registration phase (comprised of multiple register_btf_kfunc_id_set
        calls) as a single operation). In this sense, once it has been fully
        prepared, it isn't modified, only used for lookup (from the verifier
        context).
      
        For btf_vmlinux, it is initialized fully during the do_initcalls phase,
        which happens fairly early in the boot process, before any processes are
        present. This also eliminates the possibility of bpf_check being called
        at that point, thus relieving us of ensuring any synchronization between
        the registration and lookup function (btf_kfunc_id_set_contains).
      
        However, the case for module BTF is a bit tricky. The BTF is parsed,
        prepared, and published from the MODULE_STATE_COMING notifier callback.
        After this, the module initcalls are invoked, where our registration
        function will be called to populate the kfunc_set_tab for module BTF.
      
        At this point, BTF may be available to userspace while its corresponding
        module is still intializing. A BTF fd can then be passed to verifier
        using bpf syscall (e.g. for kfunc call insn).
      
        Hence, there is a race window where verifier may concurrently try to
        lookup the kfunc_set_tab. To prevent this race, we must ensure the
        operations are serialized, or waiting for the __init functions to
        complete.
      
        In the earlier registration API, this race was alleviated as verifier
        bpf_check_mod_kfunc_call didn't find the kfunc BTF ID until it was added
        by the registration function (called usually at the end of module __init
        function after all module resources have been initialized). If the
        verifier made the check_kfunc_call before kfunc BTF ID was added to the
        list, it would fail verification (saying call isn't allowed). The
        access to list was protected using a mutex.
      
        Now, it would still fail verification, but for a different reason
        (returning ENXIO due to the failed btf_try_get_module call in
        add_kfunc_call), because if the __init call is in progress the module
        will be in the middle of MODULE_STATE_COMING -> MODULE_STATE_LIVE
        transition, and the BTF_MODULE_LIVE flag for btf_module instance will
        not be set, so the btf_try_get_module call will fail.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220114163953.1455836-3-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dee872e1
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Fix UAF due to race between btf_try_get_module and load_module · 18688de2
      Kumar Kartikeya Dwivedi authored
      While working on code to populate kfunc BTF ID sets for module BTF from
      its initcall, I noticed that by the time the initcall is invoked, the
      module BTF can already be seen by userspace (and the BPF verifier). The
      existing btf_try_get_module calls try_module_get which only fails if
      mod->state == MODULE_STATE_GOING, i.e. it can increment module reference
      when module initcall is happening in parallel.
      
      Currently, BTF parsing happens from MODULE_STATE_COMING notifier
      callback. At this point, the module initcalls have not been invoked.
      The notifier callback parses and prepares the module BTF, allocates an
      ID, which publishes it to userspace, and then adds it to the btf_modules
      list allowing the kernel to invoke btf_try_get_module for the BTF.
      
      However, at this point, the module has not been fully initialized (i.e.
      its initcalls have not finished). The code in module.c can still fail
      and free the module, without caring for other users. However, nothing
      stops btf_try_get_module from succeeding between the state transition
      from MODULE_STATE_COMING to MODULE_STATE_LIVE.
      
      This leads to a use-after-free issue when BPF program loads
      successfully in the state transition, load_module's do_init_module call
      fails and frees the module, and BPF program fd on close calls module_put
      for the freed module. Future patch has test case to verify we don't
      regress in this area in future.
      
      There are multiple points after prepare_coming_module (in load_module)
      where failure can occur and module loading can return error. We
      illustrate and test for the race using the last point where it can
      practically occur (in module __init function).
      
      An illustration of the race:
      
      CPU 0                           CPU 1
      			  load_module
      			    notifier_call(MODULE_STATE_COMING)
      			      btf_parse_module
      			      btf_alloc_id	// Published to userspace
      			      list_add(&btf_mod->list, btf_modules)
      			    mod->init(...)
      ...				^
      bpf_check		        |
      check_pseudo_btf_id             |
        btf_try_get_module            |
          returns true                |  ...
      ...                             |  module __init in progress
      return prog_fd                  |  ...
      ...                             V
      			    if (ret < 0)
      			      free_module(mod)
      			    ...
      close(prog_fd)
       ...
       bpf_prog_free_deferred
        module_put(used_btf.mod) // use-after-free
      
      We fix this issue by setting a flag BTF_MODULE_F_LIVE, from the notifier
      callback when MODULE_STATE_LIVE state is reached for the module, so that
      we return NULL from btf_try_get_module for modules that are not fully
      formed. Since try_module_get already checks that module is not in
      MODULE_STATE_GOING state, and that is the only transition a live module
      can make before being removed from btf_modules list, this is enough to
      close the race and prevent the bug.
      
      A later selftest patch crafts the race condition artifically to verify
      that it has been fixed, and that verifier fails to load program (with
      ENXIO).
      
      Lastly, a couple of comments:
      
       1. Even if this race didn't exist, it seems more appropriate to only
          access resources (ksyms and kfuncs) of a fully formed module which
          has been initialized completely.
      
       2. This patch was born out of need for synchronization against module
          initcall for the next patch, so it is needed for correctness even
          without the aforementioned race condition. The BTF resources
          initialized by module initcall are set up once and then only looked
          up, so just waiting until the initcall has finished ensures correct
          behavior.
      
      Fixes: 541c3bad ("bpf: Support BPF ksym variables in kernel modules")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220114163953.1455836-2-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      18688de2