1. 15 Dec, 2023 9 commits
    • Andrii Nakryiko's avatar
      bpf: support symbolic BPF FS delegation mount options · c5707b21
      Andrii Nakryiko authored
      Besides already supported special "any" value and hex bit mask, support
      string-based parsing of delegation masks based on exact enumerator
      names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
      `enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
      symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
      prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
      BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
      normalized to lower case in mount option output. So "PROG_LOAD",
      "prog_load", and "MAP_create" are all valid values to specify for
      delegate_cmds options, "array" is among supported for map types, etc.
      
      Besides supporting string values, we also support multiple values
      specified at the same time, using colon (':') separator.
      
      There are corresponding changes on bpf_show_options side to use known
      values to print them in human-readable format, falling back to hex mask
      printing, if there are any unrecognized bits. This shouldn't be
      necessary when enum BTF information is present, but in general we should
      always be able to fall back to this even if kernel was built without BTF.
      As mentioned, emitted symbolic names are normalized to be all lower case.
      
      Example below shows various ways to specify delegate_cmds options
      through mount command and how mount options are printed back:
      
      12/14 14:39:07.604
      vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
      $ mount | rg token
      
        $ sudo mkdir -p /sys/fs/bpf/token
        $ sudo mount -t bpf bpffs /sys/fs/bpf/token \
                     -o delegate_cmds=prog_load:MAP_CREATE \
                     -o delegate_progs=kprobe \
                     -o delegate_attachs=xdp
        $ mount | grep token
        bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231214225016.1209867-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c5707b21
    • Alexei Starovoitov's avatar
      Merge branch 'add-bpf_xdp_get_xfrm_state-kfunc' · 403f3e8f
      Alexei Starovoitov authored
      Daniel Xu says:
      
      ====================
      Add bpf_xdp_get_xfrm_state() kfunc
      
      This patchset adds two kfunc helpers, bpf_xdp_get_xfrm_state() and
      bpf_xdp_xfrm_state_release() that wrap xfrm_state_lookup() and
      xfrm_state_put(). The intent is to support software RSS (via XDP) for
      the ongoing/upcoming ipsec pcpu work [0]. Recent experiments performed
      on (hopefully) reproducible AWS testbeds indicate that single tunnel
      pcpu ipsec can reach line rate on 100G ENA nics.
      
      Note this patchset only tests/shows generic xfrm_state access. The
      "secret sauce" (if you can really even call it that) involves accessing
      a soon-to-be-upstreamed pcpu_num field in xfrm_state. Early example is
      available here [1].
      
      [0]: https://datatracker.ietf.org/doc/draft-ietf-ipsecme-multi-sa-performance/03/
      [1]: https://github.com/danobi/xdp-tools/blob/e89a1c617aba3b50d990f779357d6ce2863ecb27/xdp-bench/xdp_redirect_cpumap.bpf.c#L385-L406
      
      Changes from v5:
      * Improve kfunc doc comments
      * Remove extraneous replay-window setting on selftest reverse path
      * Squash two kfunc commits into one
      * Rebase to bpf-next to pick up bitfield write patches
      * Remove testing of opts.error in selftest prog
      
      Changes from v4:
      * Fixup commit message for selftest
      * Set opts->error -ENOENT for !x
      * Revert single file xfrm + bpf
      
      Changes from v3:
      * Place all xfrm bpf integrations in xfrm_bpf.c
      * Avoid using nval as a temporary
      * Rebase to bpf-next
      * Remove extraneous __failure_unpriv annotation for verifier tests
      
      Changes from v2:
      * Fix/simplify BPF_CORE_WRITE_BITFIELD() algorithm
      * Added verifier tests for bitfield writes
      * Fix state leakage across test_tunnel subtests
      
      Changes from v1:
      * Move xfrm tunnel tests to test_progs
      * Fix writing to opts->error when opts is invalid
      * Use __bpf_kfunc_start_defs()
      * Remove unused vxlanhdr definition
      * Add and use BPF_CORE_WRITE_BITFIELD() macro
      * Make series bisect clean
      
      Changes from RFCv2:
      * Rebased to ipsec-next
      * Fix netns leak
      
      Changes from RFCv1:
      * Add Antony's commit tags
      * Add KF_ACQUIRE and KF_RELEASE semantics
      ====================
      Reviewed-by: default avatarEyal Birger <eyal.birger@gmail.com>
      Link: https://lore.kernel.org/r/cover.1702593901.git.dxu@dxuuu.xyzSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      403f3e8f
    • Daniel Xu's avatar
      bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state() · 2cd07b0e
      Daniel Xu authored
      This commit extends test_tunnel selftest to test the new XDP xfrm state
      lookup kfunc.
      Co-developed-by: default avatarAntony Antony <antony.antony@secunet.com>
      Signed-off-by: default avatarAntony Antony <antony.antony@secunet.com>
      Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Link: https://lore.kernel.org/r/e704e9a4332e3eac7b458e4bfdec8fcc6984cdb6.1702593901.git.dxu@dxuuu.xyzSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2cd07b0e
    • Daniel Xu's avatar
      bpf: selftests: Move xfrm tunnel test to test_progs · e7adc829
      Daniel Xu authored
      test_progs is better than a shell script b/c C is a bit easier to
      maintain than shell. Also it's easier to use new infra like memory
      mapped global variables from C via bpf skeleton.
      Co-developed-by: default avatarAntony Antony <antony.antony@secunet.com>
      Signed-off-by: default avatarAntony Antony <antony.antony@secunet.com>
      Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Link: https://lore.kernel.org/r/a350db9e08520c64544562d88ec005a039124d9b.1702593901.git.dxu@dxuuu.xyzSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e7adc829
    • Daniel Xu's avatar
      bpf: selftests: test_tunnel: Use vmlinux.h declarations · 02b4e126
      Daniel Xu authored
      vmlinux.h declarations are more ergnomic, especially when working with
      kfuncs. The uapi headers are often incomplete for kfunc definitions.
      
      This commit also switches bitfield accesses to use CO-RE helpers.
      Switching to vmlinux.h definitions makes the verifier very
      unhappy with raw bitfield accesses. The error is:
      
          ; md.u.md2.dir = direction;
          33: (69) r1 = *(u16 *)(r2 +11)
          misaligned stack access off (0x0; 0x0)+-64+11 size 2
      
      Fix by using CO-RE-aware bitfield reads and writes.
      Co-developed-by: default avatarAntony Antony <antony.antony@secunet.com>
      Signed-off-by: default avatarAntony Antony <antony.antony@secunet.com>
      Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Link: https://lore.kernel.org/r/884bde1d9a351d126a3923886b945ea6b1b0776b.1702593901.git.dxu@dxuuu.xyzSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      02b4e126
    • Daniel Xu's avatar
      bpf: selftests: test_tunnel: Setup fresh topology for each subtest · 77a7a822
      Daniel Xu authored
      This helps with determinism b/c individual setup/teardown prevents
      leaking state between different subtests.
      Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Link: https://lore.kernel.org/r/0fb59fa16fb58cca7def5239df606005a3e8dd0e.1702593901.git.dxu@dxuuu.xyzSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      77a7a822
    • Daniel Xu's avatar
      bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc · 8f0ec8c6
      Daniel Xu authored
      This commit adds an unstable kfunc helper to access internal xfrm_state
      associated with an SA. This is intended to be used for the upcoming
      IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
      words: for custom software RSS.
      
      That being said, the function that this kfunc wraps is fairly generic
      and used for a lot of xfrm tasks. I'm sure people will find uses
      elsewhere over time.
      
      This commit also adds a corresponding bpf_xdp_xfrm_state_release() kfunc
      to release the refcnt acquired by bpf_xdp_get_xfrm_state(). The verifier
      will require that all acquired xfrm_state's are released.
      Co-developed-by: default avatarAntony Antony <antony.antony@secunet.com>
      Signed-off-by: default avatarAntony Antony <antony.antony@secunet.com>
      Acked-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Link: https://lore.kernel.org/r/a29699c42f5fad456b875c98dd11c6afc3ffb707.1702593901.git.dxu@dxuuu.xyzSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8f0ec8c6
    • Yonghong Song's avatar
      selftests/bpf: Remove flaky test_btf_id test · 56925f38
      Yonghong Song authored
      With previous patch, one of subtests in test_btf_id becomes
      flaky and may fail. The following is a failing example:
      
        Error: #26 btf
        Error: #26/174 btf/BTF ID
          Error: #26/174 btf/BTF ID
          btf_raw_create:PASS:check 0 nsec
          btf_raw_create:PASS:check 0 nsec
          test_btf_id:PASS:check 0 nsec
          ...
          test_btf_id:PASS:check 0 nsec
          test_btf_id:FAIL:check BTF lingersdo_test_get_info:FAIL:check failed: -1
      
      The test tries to prove a btf_id not available after the map is closed.
      But btf_id is freed only after workqueue and a rcu grace period, compared
      to previous case just after a rcu grade period.
      Depending on system workload, workqueue could take quite some time
      to execute function bpf_map_free_deferred() which may cause the test failure.
      Instead of adding arbitrary delays, let us remove the logic to
      check btf_id availability after map is closed.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20231214203820.1469402-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      56925f38
    • Yonghong Song's avatar
      bpf: Fix a race condition between btf_put() and map_free() · 59e5791f
      Yonghong Song authored
      When running `./test_progs -j` in my local vm with latest kernel,
      I once hit a kasan error like below:
      
        [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
        [ 1887.186498]
        [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706-dirty #494
        [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
        [ 1887.190341] Call Trace:
        [ 1887.190666]  <TASK>
        [ 1887.190949]  dump_stack_lvl+0xac/0xe0
        [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
        [ 1887.192019]  ? panic+0x3c0/0x3c0
        [ 1887.192449]  print_report+0x14f/0x720
        [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
        [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
        [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.194572]  kasan_report+0xc3/0x100
        [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
        [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
        [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
        [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
        [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
        [ 1887.198883]  array_map_free+0x1a3/0x260
        [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
        [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
        [ 1887.200549]  worker_thread+0x633/0x890
        [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
        [ 1887.201574]  ? kthread+0x102/0x1d0
        [ 1887.202020]  kthread+0x1ab/0x1d0
        [ 1887.202447]  ? pr_cont_work+0x270/0x270
        [ 1887.202954]  ? kthread_blkcg+0x50/0x50
        [ 1887.203444]  ret_from_fork+0x34/0x50
        [ 1887.203914]  ? kthread_blkcg+0x50/0x50
        [ 1887.204397]  ret_from_fork_asm+0x11/0x20
        [ 1887.204913]  </TASK>
        [ 1887.204913]  </TASK>
        [ 1887.205209]
        [ 1887.205416] Allocated by task 2197:
        [ 1887.205881]  kasan_set_track+0x3f/0x60
        [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
        [ 1887.206856]  __kmalloc+0xac/0x1a0
        [ 1887.207293]  btf_parse_fields+0xa15/0x1480
        [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
        [ 1887.208387]  btf_new_fd+0x294/0x4d0
        [ 1887.208851]  __sys_bpf+0x4ba/0x600
        [ 1887.209292]  __x64_sys_bpf+0x41/0x50
        [ 1887.209762]  do_syscall_64+0x4c/0xf0
        [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
        [ 1887.210868]
        [ 1887.211074] Freed by task 36:
        [ 1887.211460]  kasan_set_track+0x3f/0x60
        [ 1887.211951]  kasan_save_free_info+0x28/0x40
        [ 1887.212485]  ____kasan_slab_free+0x101/0x180
        [ 1887.213027]  __kmem_cache_free+0xe4/0x210
        [ 1887.213514]  btf_free+0x5b/0x130
        [ 1887.213918]  rcu_core+0x638/0xcc0
        [ 1887.214347]  __do_softirq+0x114/0x37e
      
      The error happens at bpf_rb_root_free+0x1f8/0x2b0:
      
        00000000000034c0 <bpf_rb_root_free>:
        ; {
          34c0: f3 0f 1e fa                   endbr64
          34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
          34c9: 55                            pushq   %rbp
          34ca: 48 89 e5                      movq    %rsp, %rbp
        ...
        ;       if (rec && rec->refcount_off >= 0 &&
          36aa: 4d 85 ed                      testq   %r13, %r13
          36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
          36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
          36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
                                              <==== kasan function
          36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
                                              <==== use-after-free load
          36bc: 45 85 ff                      testl   %r15d, %r15d
          36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>
      
      So the problem is at rec->refcount_off in the above.
      
      I did some source code analysis and find the reason.
                                        CPU A                        CPU B
        bpf_map_put:
          ...
          btf_put with rcu callback
          ...
          bpf_map_free_deferred
            with system_unbound_wq
          ...                          ...                           ...
          ...                          btf_free_rcu:                 ...
          ...                          ...                           bpf_map_free_deferred:
          ...                          ...
          ...         --------->       btf_struct_metas_free()
          ...         | race condition ...
          ...         --------->                                     map->ops->map_free()
          ...
          ...                          btf->struct_meta_tab = NULL
      
      In the above, map_free() corresponds to array_map_free() and eventually
      calling bpf_rb_root_free() which calls:
        ...
        __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
        ...
      
      Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
      
        meta = btf_find_struct_meta(btf, btf_id);
        if (!meta)
          return -EFAULT;
        rec->fields[i].graph_root.value_rec = meta->record;
      
      So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
      And it is possible that that particular record has been freed by
      btf_struct_metas_free() and hence we have a kasan error here.
      
      Actually it is very hard to reproduce the failure with current bpf/bpf-next
      code, I only got the above error once. To increase reproducibility, I added
      a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
      significantly increased reproducibility.
      
        diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
        index 5e43ddd1b83f..aae5b5213e93 100644
        --- a/kernel/bpf/syscall.c
        +++ b/kernel/bpf/syscall.c
        @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
              struct bpf_map *map = container_of(work, struct bpf_map, work);
              struct btf_record *rec = map->record;
      
        +     mdelay(100);
              security_bpf_map_free(map);
              bpf_map_release_memcg(map);
              /* implementation dependent freeing */
      
      Hao also provided test cases ([1]) for easily reproducing the above issue.
      
      There are two ways to fix the issue, the v1 of the patch ([2]) moving
      btf_put() after map_free callback, and the v5 of the patch ([3]) using
      a kptr style fix which tries to get a btf reference during
      map_check_btf(). Each approach has its pro and cons. The first approach
      delays freeing btf while the second approach needs to acquire reference
      depending on context which makes logic not very elegant and may
      complicate things with future new data structures. Alexei
      suggested in [4] going back to v1 which is what this patch
      tries to do.
      
      Rerun './test_progs -j' with the above mdelay() hack for a couple
      of times and didn't observe the error for the above rb_root test cases.
      Running Hou's test ([1]) is also successful.
      
        [1] https://lore.kernel.org/bpf/20231207141500.917136-1-houtao@huaweicloud.com/
        [2] v1: https://lore.kernel.org/bpf/20231204173946.3066377-1-yonghong.song@linux.dev/
        [3] v5: https://lore.kernel.org/bpf/20231208041621.2968241-1-yonghong.song@linux.dev/
        [4] v4: https://lore.kernel.org/bpf/CAADnVQJ3FiXUhZJwX_81sjZvSYYKCFB3BT6P8D59RS2Gu+0Z7g@mail.gmail.com/
      
      Cc: Hou Tao <houtao@huaweicloud.com>
      Fixes: 958cf2e2 ("bpf: Introduce bpf_obj_new")
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20231214203815.1469107-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      59e5791f
  2. 14 Dec, 2023 26 commits
  3. 13 Dec, 2023 5 commits