1. 17 Mar, 2023 11 commits
    • Manu Bretelle's avatar
      selftests/bpf: Add --json-summary option to test_progs · 2be7aa76
      Manu Bretelle authored
      Currently, test_progs outputs all stdout/stderr as it runs, and when it
      is done, prints a summary.
      
      It is non-trivial for tooling to parse that output and extract meaningful
      information from it.
      
      This change adds a new option, `--json-summary`/`-J` that let the caller
      specify a file where `test_progs{,-no_alu32}` can write a summary of the
      run in a json format that can later be parsed by tooling.
      
      Currently, it creates a summary section with successes/skipped/failures
      followed by a list of failed tests and subtests.
      
      A test contains the following fields:
      - name: the name of the test
      - number: the number of the test
      - message: the log message that was printed by the test.
      - failed: A boolean indicating whether the test failed or not. Currently
      we only output failed tests, but in the future, successful tests could
      be added.
      - subtests: A list of subtests associated with this test.
      
      A subtest contains the following fields:
      - name: same as above
      - number: sanme as above
      - message: the log message that was printed by the subtest.
      - failed: same as above but for the subtest
      
      An example run and json content below:
      ```
      $ sudo ./test_progs -a $(grep -v '^#' ./DENYLIST.aarch64 | awk '{print
      $1","}' | tr -d '\n') -j -J /tmp/test_progs.json
      $ jq < /tmp/test_progs.json | head -n 30
      {
        "success": 29,
        "success_subtest": 23,
        "skipped": 3,
        "failed": 28,
        "results": [
          {
            "name": "bpf_cookie",
            "number": 10,
            "message": "test_bpf_cookie:PASS:skel_open 0 nsec\n",
            "failed": true,
            "subtests": [
              {
                "name": "multi_kprobe_link_api",
                "number": 2,
                "message": "kprobe_multi_link_api_subtest:PASS:load_kallsyms 0 nsec\nlibbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n",
                "failed": true
              },
              {
                "name": "multi_kprobe_attach_api",
                "number": 3,
                "message": "libbpf: extern 'bpf_testmod_fentry_test1' (strong): not resolved\nlibbpf: failed to load object 'kprobe_multi'\nlibbpf: failed to load BPF skeleton 'kprobe_multi': -3\nkprobe_multi_attach_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3\n",
                "failed": true
              },
              {
                "name": "lsm",
                "number": 8,
                "message": "lsm_subtest:PASS:lsm.link_create 0 nsec\nlsm_subtest:FAIL:stack_mprotect unexpected stack_mprotect: actual 0 != expected -1\n",
                "failed": true
              }
      ```
      
      The file can then be used to print a summary of the test run and list of
      failing tests/subtests:
      
      ```
      $ jq -r < /tmp/test_progs.json '"Success: \(.success)/\(.success_subtest), Skipped: \(.skipped), Failed: \(.failed)"'
      
      Success: 29/23, Skipped: 3, Failed: 28
      $ jq -r < /tmp/test_progs.json '.results | map([
          if .failed then "#\(.number) \(.name)" else empty end,
          (
              . as {name: $tname, number: $tnum} | .subtests | map(
                  if .failed then "#\($tnum)/\(.number) \($tname)/\(.name)" else empty end
              )
          )
      ]) | flatten | .[]' | head -n 20
       #10 bpf_cookie
       #10/2 bpf_cookie/multi_kprobe_link_api
       #10/3 bpf_cookie/multi_kprobe_attach_api
       #10/8 bpf_cookie/lsm
       #15 bpf_mod_race
       #15/1 bpf_mod_race/ksym (used_btfs UAF)
       #15/2 bpf_mod_race/kfunc (kfunc_btf_tab UAF)
       #36 cgroup_hierarchical_stats
       #61 deny_namespace
       #61/1 deny_namespace/unpriv_userns_create_no_bpf
       #73 fexit_stress
       #83 get_func_ip_test
       #99 kfunc_dynptr_param
       #99/1 kfunc_dynptr_param/dynptr_data_null
       #99/4 kfunc_dynptr_param/dynptr_data_null
       #100 kprobe_multi_bench_attach
       #100/1 kprobe_multi_bench_attach/kernel
       #100/2 kprobe_multi_bench_attach/modules
       #101 kprobe_multi_test
       #101/1 kprobe_multi_test/skel_api
      ```
      Signed-off-by: default avatarManu Bretelle <chantr4@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230317163256.3809328-1-chantr4@gmail.com
      2be7aa76
    • Andrii Nakryiko's avatar
      Merge branch 'bpf: Add detection of kfuncs.' · 6cae5a71
      Andrii Nakryiko authored
      Alexei Starovoitov says:
      
      ====================
      
      From: Alexei Starovoitov <ast@kernel.org>
      
      Allow BPF programs detect at load time whether particular kfunc exists.
      
      Patch 1: Allow ld_imm64 to point to kfunc in the kernel.
      Patch 2: Fix relocation of kfunc in ld_imm64 insn when kfunc is in kernel module.
      Patch 3: Introduce bpf_ksym_exists() macro.
      Patch 4: selftest.
      
      NOTE: detection of kfuncs from light skeleton is not supported yet.
      ====================
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      6cae5a71
    • Alexei Starovoitov's avatar
      selftests/bpf: Add test for bpf_ksym_exists(). · 95fdf6e3
      Alexei Starovoitov authored
      Add load and run time test for bpf_ksym_exists() and check that the verifier
      performs dead code elimination for non-existing kfunc.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20230317201920.62030-5-alexei.starovoitov@gmail.com
      95fdf6e3
    • Alexei Starovoitov's avatar
      libbpf: Introduce bpf_ksym_exists() macro. · 5cbd3fe3
      Alexei Starovoitov authored
      Introduce bpf_ksym_exists() macro that can be used by BPF programs
      to detect at load time whether particular ksym (either variable or kfunc)
      is present in the kernel.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230317201920.62030-4-alexei.starovoitov@gmail.com
      5cbd3fe3
    • Alexei Starovoitov's avatar
      libbpf: Fix relocation of kfunc ksym in ld_imm64 insn. · 5fc13ad5
      Alexei Starovoitov authored
      void *p = kfunc; -> generates ld_imm64 insn.
      kfunc() -> generates bpf_call insn.
      
      libbpf patches bpf_call insn correctly while only btf_id part of ld_imm64 is
      set in the former case. Which means that pointers to kfuncs in modules are not
      patched correctly and the verifier rejects load of such programs due to btf_id
      being out of range. Fix libbpf to patch ld_imm64 for kfunc.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230317201920.62030-3-alexei.starovoitov@gmail.com
      5fc13ad5
    • Alexei Starovoitov's avatar
      bpf: Allow ld_imm64 instruction to point to kfunc. · 58aa2afb
      Alexei Starovoitov authored
      Allow ld_imm64 insn with BPF_PSEUDO_BTF_ID to hold the address of kfunc. The
      ld_imm64 pointing to a valid kfunc will be seen as non-null PTR_TO_MEM by
      is_branch_taken() logic of the verifier, while libbpf will resolve address to
      unknown kfunc as ld_imm64 reg, 0 which will also be recognized by
      is_branch_taken() and the verifier will proceed dead code elimination. BPF
      programs can use this logic to detect at load time whether kfunc is present in
      the kernel with bpf_ksym_exists() macro that is introduced in the next patches.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20230317201920.62030-2-alexei.starovoitov@gmail.com
      58aa2afb
    • Bagas Sanjaya's avatar
      bpf, docs: Use internal linking for link to netdev subsystem doc · 0f10f647
      Bagas Sanjaya authored
      Commit d56b0c46 ("bpf, docs: Fix link to netdev-FAQ target")
      attempts to fix linking problem to undefined "netdev-FAQ" label
      introduced in 287f4fa9 ("docs: Update references to netdev-FAQ")
      by changing internal cross reference to netdev subsystem documentation
      (Documentation/process/maintainer-netdev.rst) to external one at
      docs.kernel.org. However, the linking problem is still not
      resolved, as the generated link points to non-existent netdev-FAQ
      section of the external doc, which when clicked, will instead going
      to the top of the doc.
      
      Revert back to internal linking by simply mention the doc path while
      massaging the leading text to the link, since the netdev subsystem
      doc contains no FAQs but rather general information about the subsystem.
      
      Fixes: d56b0c46 ("bpf, docs: Fix link to netdev-FAQ target")
      Fixes: 287f4fa9 ("docs: Update references to netdev-FAQ")
      Signed-off-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230314074449.23620-1-bagasdotme@gmail.com
      0f10f647
    • Viktor Malik's avatar
      kallsyms, bpf: Move find_kallsyms_symbol_value out of internal header · bd5314f8
      Viktor Malik authored
      Moving find_kallsyms_symbol_value from kernel/module/internal.h to
      include/linux/module.h. The reason is that internal.h is not prepared to
      be included when CONFIG_MODULES=n. find_kallsyms_symbol_value is used by
      kernel/bpf/verifier.c and including internal.h from it (without modules)
      leads into a compilation error:
      
        In file included from ../include/linux/container_of.h:5,
                         from ../include/linux/list.h:5,
                         from ../include/linux/timer.h:5,
                         from ../include/linux/workqueue.h:9,
                         from ../include/linux/bpf.h:10,
                         from ../include/linux/bpf-cgroup.h:5,
                         from ../kernel/bpf/verifier.c:7:
        ../kernel/bpf/../module/internal.h: In function 'mod_find':
        ../include/linux/container_of.h:20:54: error: invalid use of undefined type 'struct module'
           20 |         static_assert(__same_type(*(ptr), ((type *)0)->member) ||       \
              |                                                      ^~
        [...]
      
      This patch fixes the above error.
      
      Fixes: 31bf1dbc ("bpf: Fix attaching fentry/fexit/fmod_ret/lsm to modules")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarViktor Malik <vmalik@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/oe-kbuild-all/202303161404.OrmfCy09-lkp@intel.com/
      Link: https://lore.kernel.org/bpf/20230317095601.386738-1-vmalik@redhat.com
      bd5314f8
    • Alexei Starovoitov's avatar
      Merge branch 'double-fix bpf_test_run + XDP_PASS recycling' · 94bbbdfb
      Alexei Starovoitov authored
      Alexander Lobakin says:
      
      ====================
      
      Enabling skb PP recycling revealed a couple issues in the bpf_test_run
      code. Recycling broke the assumption that the headroom won't ever be
      touched during the test_run execution: xdp_scrub_frame() invalidates the
      XDP frame at the headroom start, while neigh xmit code overwrites 2 bytes
      to the left of the Ethernet header. The first makes the kernel panic in
      certain cases, while the second breaks xdp_do_redirect selftest on BE.
      test_run is a limited-scope entity, so let's hope no more corner cases
      will happen here or at least they will be as easy and pleasant to fix
      as those two.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      94bbbdfb
    • Alexander Lobakin's avatar
      selftests/bpf: fix "metadata marker" getting overwritten by the netstack · 5640b6d8
      Alexander Lobakin authored
      Alexei noticed xdp_do_redirect test on BPF CI started failing on
      BE systems after skb PP recycling was enabled:
      
      test_xdp_do_redirect:PASS:prog_run 0 nsec
      test_xdp_do_redirect:PASS:pkt_count_xdp 0 nsec
      test_xdp_do_redirect:PASS:pkt_count_zero 0 nsec
      test_xdp_do_redirect:FAIL:pkt_count_tc unexpected pkt_count_tc: actual
      220 != expected 9998
      test_max_pkt_size:PASS:prog_run_max_size 0 nsec
      test_max_pkt_size:PASS:prog_run_too_big 0 nsec
      close_netns:PASS:setns 0 nsec
       #289 xdp_do_redirect:FAIL
      Summary: 270/1674 PASSED, 30 SKIPPED, 1 FAILED
      
      and it doesn't happen on LE systems.
      Ilya then hunted it down to:
      
       #0  0x0000000000aaeee6 in neigh_hh_output (hh=0x83258df0,
      skb=0x88142200) at linux/include/net/neighbour.h:503
       #1  0x0000000000ab2cda in neigh_output (skip_cache=false,
      skb=0x88142200, n=<optimized out>) at linux/include/net/neighbour.h:544
       #2  ip6_finish_output2 (net=net@entry=0x88edba00, sk=sk@entry=0x0,
      skb=skb@entry=0x88142200) at linux/net/ipv6/ip6_output.c:134
       #3  0x0000000000ab4cbc in __ip6_finish_output (skb=0x88142200, sk=0x0,
      net=0x88edba00) at linux/net/ipv6/ip6_output.c:195
       #4  ip6_finish_output (net=0x88edba00, sk=0x0, skb=0x88142200) at
      linux/net/ipv6/ip6_output.c:206
      
      xdp_do_redirect test places a u32 marker (0x42) right before the Ethernet
      header to check it then in the XDP program and return %XDP_ABORTED if it's
      not there. Neigh xmit code likes to round up hard header length to speed
      up copying the header, so it overwrites two bytes in front of the Eth
      header. On LE systems, 0x42 is one byte at `data - 4`, while on BE it's
      `data - 1`, what explains why it happens only there.
      It didn't happen previously due to that %XDP_PASS meant the page will be
      discarded and replaced by a new one, but now it can be recycled as well,
      while bpf_test_run code doesn't reinitialize the content of recycled
      pages. This mark is limited to this particular test and its setup though,
      so there's no need to predict 1000 different possible cases. Just move
      it 4 bytes to the left, still keeping it 32 bit to match on more bytes.
      
      Fixes: 9c94bbf9 ("xdp: recycle Page Pool backed skbs built from XDP frames")
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/CAADnVQ+B_JOU+EpP=DKhbY9yXdN6GiRPnpTTXfEZ9sNkUeb-yQ@mail.gmail.com
      Reported-by: Ilya Leoshkevich <iii@linux.ibm.com> # + debugging
      Link: https://lore.kernel.org/bpf/8341c1d9f935f410438e79d3bd8a9cc50aefe105.camel@linux.ibm.comSigned-off-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Tested-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/r/20230316175051.922550-3-aleksander.lobakin@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5640b6d8
    • Alexander Lobakin's avatar
      bpf, test_run: fix crashes due to XDP frame overwriting/corruption · e5995bc7
      Alexander Lobakin authored
      syzbot and Ilya faced the splats when %XDP_PASS happens for bpf_test_run
      after skb PP recycling was enabled for {__,}xdp_build_skb_from_frame():
      
      BUG: kernel NULL pointer dereference, address: 0000000000000d28
      RIP: 0010:memset_erms+0xd/0x20 arch/x86/lib/memset_64.S:66
      [...]
      Call Trace:
       <TASK>
       __finalize_skb_around net/core/skbuff.c:321 [inline]
       __build_skb_around+0x232/0x3a0 net/core/skbuff.c:379
       build_skb_around+0x32/0x290 net/core/skbuff.c:444
       __xdp_build_skb_from_frame+0x121/0x760 net/core/xdp.c:622
       xdp_recv_frames net/bpf/test_run.c:248 [inline]
       xdp_test_run_batch net/bpf/test_run.c:334 [inline]
       bpf_test_run_xdp_live+0x1289/0x1930 net/bpf/test_run.c:362
       bpf_prog_test_run_xdp+0xa05/0x14e0 net/bpf/test_run.c:1418
      [...]
      
      This happens due to that it calls xdp_scrub_frame(), which nullifies
      xdpf->data. bpf_test_run code doesn't reinit the frame when the XDP
      program doesn't adjust head or tail. Previously, %XDP_PASS meant the
      page will be released from the pool and returned to the MM layer, but
      now it does return to the Pool with the nullified xdpf->data, which
      doesn't get reinitialized then.
      So, in addition to checking whether the head and/or tail have been
      adjusted, check also for a potential XDP frame corruption. xdpf->data
      is 100% affected and also xdpf->flags is the field closest to the
      metadata / frame start. Checking for these two should be enough for
      non-extreme cases.
      
      Fixes: 9c94bbf9 ("xdp: recycle Page Pool backed skbs built from XDP frames")
      Reported-by: syzbot+e1d1b65f7c32f2a86a9f@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/bpf/000000000000f1985705f6ef2243@google.comReported-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/bpf/e07dd94022ad5731705891b9487cc9ed66328b94.camel@linux.ibm.comSigned-off-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Tested-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/r/20230316175051.922550-2-aleksander.lobakin@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e5995bc7
  2. 16 Mar, 2023 13 commits
  3. 14 Mar, 2023 14 commits
  4. 13 Mar, 2023 2 commits
    • Dave Marchevsky's avatar
      bpf: Disable migration when freeing stashed local kptr using obj drop · 9e36a204
      Dave Marchevsky authored
      When a local kptr is stashed in a map and freed when the map goes away,
      currently an error like the below appears:
      
      [   39.195695] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u32:15/2875
      [   39.196549] caller is bpf_mem_free+0x56/0xc0
      [   39.196958] CPU: 15 PID: 2875 Comm: kworker/u32:15 Tainted: G           O       6.2.0-13016-g22df776a #4477
      [   39.197897] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
      [   39.198949] Workqueue: events_unbound bpf_map_free_deferred
      [   39.199470] Call Trace:
      [   39.199703]  <TASK>
      [   39.199911]  dump_stack_lvl+0x60/0x70
      [   39.200267]  check_preemption_disabled+0xbf/0xe0
      [   39.200704]  bpf_mem_free+0x56/0xc0
      [   39.201032]  ? bpf_obj_new_impl+0xa0/0xa0
      [   39.201430]  bpf_obj_free_fields+0x1cd/0x200
      [   39.201838]  array_map_free+0xad/0x220
      [   39.202193]  ? finish_task_switch+0xe5/0x3c0
      [   39.202614]  bpf_map_free_deferred+0xea/0x210
      [   39.203006]  ? lockdep_hardirqs_on_prepare+0xe/0x220
      [   39.203460]  process_one_work+0x64f/0xbe0
      [   39.203822]  ? pwq_dec_nr_in_flight+0x110/0x110
      [   39.204264]  ? do_raw_spin_lock+0x107/0x1c0
      [   39.204662]  ? lockdep_hardirqs_on_prepare+0xe/0x220
      [   39.205107]  worker_thread+0x74/0x7a0
      [   39.205451]  ? process_one_work+0xbe0/0xbe0
      [   39.205818]  kthread+0x171/0x1a0
      [   39.206111]  ? kthread_complete_and_exit+0x20/0x20
      [   39.206552]  ret_from_fork+0x1f/0x30
      [   39.206886]  </TASK>
      
      This happens because the call to __bpf_obj_drop_impl I added in the patch
      adding support for stashing local kptrs doesn't disable migration. Prior
      to that patch, __bpf_obj_drop_impl logic only ran when called by a BPF
      progarm, whereas now it can be called from map free path, so it's
      necessary to explicitly disable migration.
      
      Also, refactor a bit to just call __bpf_obj_drop_impl directly instead
      of bothering w/ dtor union and setting pointer-to-obj_drop.
      
      Fixes: c8e18754 ("bpf: Support __kptr to local kptrs")
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230313214641.3731908-1-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9e36a204
    • David Vernet's avatar
      tasks: Extract rcu_users out of union · 22df776a
      David Vernet authored
      In commit 3fbd7ee2 ("tasks: Add a count of task RCU users"), a
      count on the number of RCU users was added to struct task_struct. This
      was done so as to enable the removal of task_rcu_dereference(), and
      allow tasks to be protected by RCU even after exiting and being removed
      from the runqueue. In this commit, the 'refcount_t rcu_users' field that
      keeps track of this refcount was put into a union co-located with
      'struct rcu_head rcu', so as to avoid taking up any extra space in
      task_struct. This was possible to do safely, because the field was only
      ever decremented by a static set of specific callers, and then never
      incremented again.
      
      While this restriction of there only being a small, static set of users
      of this field has worked fine, it prevents us from leveraging the field
      to use RCU to protect tasks in other contexts.
      
      During tracing, for example, it would be useful to be able to collect
      some tasks that performed a certain operation, put them in a map, and
      then periodically summarize who they are, which cgroup they're in, how
      much CPU time they've utilized, etc. While this can currently be done
      with 'usage', it becomes tricky when a task is already in a map, or if a
      reference should only be taken if a task is valid and will not soon be
      reaped. Ideally, we could do something like pass a reference to a map
      value, and then try to acquire a reference to the task in an RCU read
      region by using refcount_inc_not_zero().
      
      Similarly, in sched_ext, schedulers are using integer pids to remember
      tasks, and then looking them up with find_task_by_pid_ns(). This is
      slow, error prone, and adds complexity. It would be more convenient and
      performant if BPF schedulers could instead store tasks directly in maps,
      and then leverage RCU to ensure they can be safely accessed with low
      overhead.
      
      Finally, overloading fields like this is error prone. Someone that wants
      to use 'rcu_users' could easily overlook the fact that once the rcu
      callback is scheduled, the refcount will go back to being nonzero, thus
      precluding the use of refcount_inc_not_zero(). Furthermore, as described
      below, it's possible to extract the fields of the union without changing
      the size of task_struct.
      
      There are several possible ways to enable this:
      
      1. The lightest touch approach is likely the one proposed in this patch,
         which is to simply extract 'rcu_users' and 'rcu' from the union, so
         that scheduling the 'rcu' callback doesn't overwrite the 'rcu_users'
         refcount. If we have a trusted task pointer, this would allow us to
         use refcnt_inc_not_zero() inside of an RCU region to determine if we
         can safely acquire a reference to the task and store it in a map. As
         mentioned below, this can be done without changing the size of
         task_struct, by moving the location of the union to another location
         that has padding gaps we can fill in.
      
      2. Removing 'refcount_t rcu_users', and instead having the entire task
         be freed in an rcu callback. This is likely the most sound overall
         design, though it changes the behavioral semantics exposed to
         callers, who currently expect that a task that's successfully looked
         up in e.g. the pid_list with find_task_by_pid_ns(), can always have a
         'usage' reference acquired on them, as it's guaranteed to be >
         0 until after the next gp. In order for this approach to work, we'd
         have to audit all callers. This approach also slightly changes
         behavior observed by user space by not invoking
         trace_sched_process_free() until the whole task_struct is actually being
         freed, rather than just after it's exited. It also may change
         timings, as memory will be freed in an RCU callback rather than
         immediately when the final 'usage' refcount drops to 0. This also is
         arguably a benefit, as it provides more predictable performance to
         callers who are refcounting tasks.
      
      3. There may be other solutions as well that don't require changing the
         layout of task_struct. For example, we could possibly do something
         complex from the BPF side, such as listen for task exit and remove a
         task from a map when the task is exiting. This would likely require
         significant custom handling for task_struct in the verifier, so a
         more generalizable solution is likely warranted.
      
      As mentioned above, this patch proposes the lightest-touch approach
      which allows callers elsewhere in the kernel to use 'rcu_users' to
      ensure the lifetime of a task, by extracting 'rcu_users' and 'rcu' from
      the union. There is no size change in task_struct with this patch.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lore.kernel.org/r/20230215233033.889644-1-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      22df776a