1. 21 Jun, 2023 1 commit
    • Florent Revest's avatar
      bpf/btf: Accept function names that contain dots · 9724160b
      Florent Revest authored
      When building a kernel with LLVM=1, LLVM_IAS=0 and CONFIG_KASAN=y, LLVM
      leaves DWARF tags for the "asan.module_ctor" & co symbols. In turn,
      pahole creates BTF_KIND_FUNC entries for these and this makes the BTF
      metadata validation fail because they contain a dot.
      
      In a dramatic turn of event, this BTF verification failure can cause
      the netfilter_bpf initialization to fail, causing netfilter_core to
      free the netfilter_helper hashmap and netfilter_ftp to trigger a
      use-after-free. The risk of u-a-f in netfilter will be addressed
      separately but the existence of "asan.module_ctor" debug info under some
      build conditions sounds like a good enough reason to accept functions
      that contain dots in BTF.
      
      Although using only LLVM=1 is the recommended way to compile clang-based
      kernels, users can certainly do LLVM=1, LLVM_IAS=0 as well and we still
      try to support that combination according to Nick. To clarify:
      
        - > v5.10 kernel, LLVM=1 (LLVM_IAS=0 is not the default) is recommended,
          but user can still have LLVM=1, LLVM_IAS=0 to trigger the issue
      
        - <= 5.10 kernel, LLVM=1 (LLVM_IAS=0 is the default) is recommended in
          which case GNU as will be used
      
      Fixes: 1dc92851 ("bpf: kernel side support for BTF Var and DataSec")
      Signed-off-by: default avatarFlorent Revest <revest@chromium.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Cc: Yonghong Song <yhs@meta.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Link: https://lore.kernel.org/bpf/20230615145607.3469985-1-revest@chromium.org
      9724160b
  2. 13 Jun, 2023 3 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: fix NULL dereference during extable search' · b78b34c6
      Alexei Starovoitov authored
      Krister Johansen says:
      
      ====================
      Hi,
      Enclosed are a pair of patches for an oops that can occur if an exception is
      generated while a bpf subprogram is running.  One of the bpf_prog_aux entries
      for the subprograms are missing an extable.  This can lead to an exception that
      would otherwise be handled turning into a NULL pointer bug.
      
      These changes were tested via the verifier and progs selftests and no
      regressions were observed.
      
      Changes from v4:
      - Ensure that num_exentries is copied to prog->aux from func[0] (Feedback from
        Ilya Leoshkevich)
      
      Changes from v3:
      - Selftest style fixups (Feedback from Yonghong Song)
      - Selftest needs to assert that test bpf program executed (Feedback from
        Yonghong Song)
      - Selftest should combine open and load using open_and_load (Feedback from
        Yonghong Song)
      
      Changes from v2:
      - Insert only the main program's kallsyms (Feedback from Yonghong Song and
        Alexei Starovoitov)
      - Selftest should use ASSERT instead of CHECK (Feedback from Yonghong Song)
      - Selftest needs some cleanup (Feedback from Yonghong Song)
      - Switch patch order (Feedback from Alexei Starovoitov)
      
      Changes from v1:
      - Add a selftest (Feedback From Alexei Starovoitov)
      - Move to a 1-line verifier change instead of searching multiple extables
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b78b34c6
    • Krister Johansen's avatar
      selftests/bpf: add a test for subprogram extables · 84a62b44
      Krister Johansen authored
      In certain situations a program with subprograms may have a NULL
      extable entry.  This should not happen, and when it does, it turns a
      single trap into multiple.  Add a test case for further debugging and to
      prevent regressions.
      
      The test-case contains three essentially identical versions of the same
      test because just one program may not be sufficient to trigger the oops.
      This is due to the fact that the items are stored in a binary tree and
      have identical values so it's possible to sometimes find the ksym with
      the extable.  With 3 copies, this has been reliable on this author's
      test systems.
      
      When triggered out of this test case, the oops looks like this:
      
         BUG: kernel NULL pointer dereference, address: 000000000000000c
         #PF: supervisor read access in kernel mode
         #PF: error_code(0x0000) - not-present page
         PGD 0 P4D 0
         Oops: 0000 [#1] PREEMPT SMP NOPTI
         CPU: 0 PID: 1132 Comm: test_progs Tainted: G           OE      6.4.0-rc3+ #2
         RIP: 0010:cmp_ex_search+0xb/0x30
         Code: cc cc cc cc e8 36 cb 03 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 55 48 89 e5 48 8b 07 <48> 63 0e 48 01 f1 31 d2 48 39 c8 19 d2 48 39 c8 b8 01 00 00 00 0f
         RSP: 0018:ffffb30c4291f998 EFLAGS: 00010006
         RAX: ffffffffc00b49da RBX: 0000000000000002 RCX: 000000000000000c
         RDX: 0000000000000002 RSI: 000000000000000c RDI: ffffb30c4291f9e8
         RBP: ffffb30c4291f998 R08: ffffffffab1a42d0 R09: 0000000000000001
         R10: 0000000000000000 R11: ffffffffab1a42d0 R12: ffffb30c4291f9e8
         R13: 000000000000000c R14: 000000000000000c R15: 0000000000000000
         FS:  00007fb5d9e044c0(0000) GS:ffff92e95ee00000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 000000000000000c CR3: 000000010c3a2005 CR4: 00000000007706f0
         DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         PKRU: 55555554
         Call Trace:
          <TASK>
          bsearch+0x41/0x90
          ? __pfx_cmp_ex_search+0x10/0x10
          ? bpf_prog_45a7907e7114d0ff_handle_fexit_ret_subprogs3+0x2a/0x6c
          search_extable+0x3b/0x60
          ? bpf_prog_45a7907e7114d0ff_handle_fexit_ret_subprogs3+0x2a/0x6c
          search_bpf_extables+0x10d/0x190
          ? bpf_prog_45a7907e7114d0ff_handle_fexit_ret_subprogs3+0x2a/0x6c
          search_exception_tables+0x5d/0x70
          fixup_exception+0x3f/0x5b0
          ? look_up_lock_class+0x61/0x110
          ? __lock_acquire+0x6b8/0x3560
          ? __lock_acquire+0x6b8/0x3560
          ? __lock_acquire+0x6b8/0x3560
          kernelmode_fixup_or_oops+0x46/0x110
          __bad_area_nosemaphore+0x68/0x2b0
          ? __lock_acquire+0x6b8/0x3560
          bad_area_nosemaphore+0x16/0x20
          do_kern_addr_fault+0x81/0xa0
          exc_page_fault+0xd6/0x210
          asm_exc_page_fault+0x2b/0x30
         RIP: 0010:bpf_prog_45a7907e7114d0ff_handle_fexit_ret_subprogs3+0x2a/0x6c
         Code: f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3 0f 1e fa 48 8b 7f 08 49 bb 00 00 00 00 00 80 00 00 4c 39 df 73 04 31 f6 eb 04 <48> 8b 77 00 49 bb 00 00 00 00 00 80 00 00 48 81 c7 7c 00 00 00 4c
         RSP: 0018:ffffb30c4291fcb8 EFLAGS: 00010282
         RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
         RDX: 00000000cddf1af1 RSI: 000000005315a00d RDI: ffffffffffffffea
         RBP: ffffb30c4291fcb8 R08: ffff92e644bf38a8 R09: 0000000000000000
         R10: 0000000000000000 R11: 0000800000000000 R12: ffff92e663652690
         R13: 00000000000001c8 R14: 00000000000001c8 R15: 0000000000000003
          bpf_trampoline_251255721842_2+0x63/0x1000
          bpf_testmod_return_ptr+0x9/0xb0 [bpf_testmod]
          ? bpf_testmod_test_read+0x43/0x2d0 [bpf_testmod]
          sysfs_kf_bin_read+0x60/0x90
          kernfs_fop_read_iter+0x143/0x250
          vfs_read+0x240/0x2a0
          ksys_read+0x70/0xe0
          __x64_sys_read+0x1f/0x30
          do_syscall_64+0x68/0xa0
          ? syscall_exit_to_user_mode+0x77/0x1f0
          ? do_syscall_64+0x77/0xa0
          ? irqentry_exit+0x35/0xa0
          ? sysvec_apic_timer_interrupt+0x4d/0x90
          entry_SYSCALL_64_after_hwframe+0x72/0xdc
         RIP: 0033:0x7fb5da00a392
         Code: ac 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb be 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
         RSP: 002b:00007ffc5b3cab68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
         RAX: ffffffffffffffda RBX: 000055bee7b8b100 RCX: 00007fb5da00a392
         RDX: 00000000000001c8 RSI: 0000000000000000 RDI: 0000000000000009
         RBP: 00007ffc5b3caba0 R08: 0000000000000000 R09: 0000000000000037
         R10: 000055bee7b8c2a7 R11: 0000000000000246 R12: 000055bee78f1f60
         R13: 00007ffc5b3cae90 R14: 0000000000000000 R15: 0000000000000000
          </TASK>
         Modules linked in: bpf_testmod(OE) nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common intel_uncore_frequency_common ppdev nfit crct10dif_pclmul crc32_pclmul psmouse ghash_clmulni_intel sha512_ssse3 aesni_intel parport_pc crypto_simd cryptd input_leds parport rapl ena i2c_piix4 mac_hid serio_raw ramoops reed_solomon pstore_blk drm pstore_zone efi_pstore autofs4 [last unloaded: bpf_testmod(OE)]
         CR2: 000000000000000c
      
      Though there may be some variation, depending on which suprogram
      triggers the bug.
      Signed-off-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/4ebf95ec857cd785b81db69f3e408c039ad8408b.1686616663.git.kjlx@templeofstupid.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      84a62b44
    • Krister Johansen's avatar
      bpf: ensure main program has an extable · 0108a4e9
      Krister Johansen authored
      When subprograms are in use, the main program is not jit'd after the
      subprograms because jit_subprogs sets a value for prog->bpf_func upon
      success.  Subsequent calls to the JIT are bypassed when this value is
      non-NULL.  This leads to a situation where the main program and its
      func[0] counterpart are both in the bpf kallsyms tree, but only func[0]
      has an extable.  Extables are only created during JIT.  Now there are
      two nearly identical program ksym entries in the tree, but only one has
      an extable.  Depending upon how the entries are placed, there's a chance
      that a fault will call search_extable on the aux with the NULL entry.
      
      Since jit_subprogs already copies state from func[0] to the main
      program, include the extable pointer in this state duplication.
      Additionally, ensure that the copy of the main program in func[0] is not
      added to the bpf_prog_kallsyms table. Instead, let the main program get
      added later in bpf_prog_load().  This ensures there is only a single
      copy of the main program in the kallsyms table, and that its tag matches
      the tag observed by tooling like bpftool.
      
      Cc: stable@vger.kernel.org
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Tested-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/r/6de9b2f4b4724ef56efbb0339daaa66c8b68b1e7.1686616663.git.kjlx@templeofstupid.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0108a4e9
  3. 12 Jun, 2023 1 commit
    • Yonghong Song's avatar
      bpf: Fix a bpf_jit_dump issue for x86_64 with sysctl bpf_jit_enable. · ad96f1c9
      Yonghong Song authored
      The sysctl net/core/bpf_jit_enable does not work now due to commit
      1022a549 ("bpf, x86_64: Use bpf_jit_binary_pack_alloc"). The
      commit saved the jitted insns into 'rw_image' instead of 'image'
      which caused bpf_jit_dump not dumping proper content.
      
      With 'echo 2 > /proc/sys/net/core/bpf_jit_enable', run
      './test_progs -t fentry_test'. Without this patch, one of jitted
      image for one particular prog is:
      
        flen=17 proglen=92 pass=4 image=0000000014c64883 from=test_progs pid=1807
        00000000: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
        00000010: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
        00000020: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
        00000030: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
        00000040: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
        00000050: cc cc cc cc cc cc cc cc cc cc cc cc
      
      With this patch, the jitte image for the same prog is:
      
        flen=17 proglen=92 pass=4 image=00000000b90254b7 from=test_progs pid=1809
        00000000: f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3
        00000010: 0f 1e fa 31 f6 48 8b 57 00 48 83 fa 07 75 2b 48
        00000020: 8b 57 10 83 fa 09 75 22 48 8b 57 08 48 81 e2 ff
        00000030: 00 00 00 48 83 fa 08 75 11 48 8b 7f 18 be 01 00
        00000040: 00 00 48 83 ff 0a 74 02 31 f6 48 bf 18 d0 14 00
        00000050: 00 c9 ff ff 48 89 77 00 31 c0 c9 c3
      
      Fixes: 1022a549 ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/bpf/20230609005439.3173569-1-yhs@fb.com
      ad96f1c9
  4. 08 Jun, 2023 8 commits
  5. 07 Jun, 2023 18 commits
    • Jiri Olsa's avatar
      bpf: Add extra path pointer check to d_path helper · f46fab0e
      Jiri Olsa authored
      Anastasios reported crash on stable 5.15 kernel with following
      BPF attached to lsm hook:
      
        SEC("lsm.s/bprm_creds_for_exec")
        int BPF_PROG(bprm_creds_for_exec, struct linux_binprm *bprm)
        {
                struct path *path = &bprm->executable->f_path;
                char p[128] = { 0 };
      
                bpf_d_path(path, p, 128);
                return 0;
        }
      
      But bprm->executable can be NULL, so bpf_d_path call will crash:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000018
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
        ...
        RIP: 0010:d_path+0x22/0x280
        ...
        Call Trace:
         <TASK>
         bpf_d_path+0x21/0x60
         bpf_prog_db9cf176e84498d9_bprm_creds_for_exec+0x94/0x99
         bpf_trampoline_6442506293_0+0x55/0x1000
         bpf_lsm_bprm_creds_for_exec+0x5/0x10
         security_bprm_creds_for_exec+0x29/0x40
         bprm_execve+0x1c1/0x900
         do_execveat_common.isra.0+0x1af/0x260
         __x64_sys_execve+0x32/0x40
      
      It's problem for all stable trees with bpf_d_path helper, which was
      added in 5.9.
      
      This issue is fixed in current bpf code, where we identify and mark
      trusted pointers, so the above code would fail even to load.
      
      For the sake of the stable trees and to workaround potentially broken
      verifier in the future, adding the code that reads the path object from
      the passed pointer and verifies it's valid in kernel space.
      
      Fixes: 6e22ab9d ("bpf: Add d_path helper")
      Reported-by: default avatarAnastasios Papagiannis <tasos.papagiannnis@gmail.com>
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20230606181714.532998-1-jolsa@kernel.org
      f46fab0e
    • Hangyu Hua's avatar
      net: sched: fix possible refcount leak in tc_chain_tmplt_add() · 44f8baaf
      Hangyu Hua authored
      try_module_get will be called in tcf_proto_lookup_ops. So module_put needs
      to be called to drop the refcount if ops don't implement the required
      function.
      
      Fixes: 9f407f17 ("net: sched: introduce chain templates")
      Signed-off-by: default avatarHangyu Hua <hbh25y@gmail.com>
      Reviewed-by: default avatarLarysa Zaremba <larysa.zaremba@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44f8baaf
    • Eric Dumazet's avatar
      net: sched: act_police: fix sparse errors in tcf_police_dump() · 682881ee
      Eric Dumazet authored
      Fixes following sparse errors:
      
      net/sched/act_police.c:360:28: warning: dereference of noderef expression
      net/sched/act_police.c:362:45: warning: dereference of noderef expression
      net/sched/act_police.c:362:45: warning: dereference of noderef expression
      net/sched/act_police.c:368:28: warning: dereference of noderef expression
      net/sched/act_police.c:370:45: warning: dereference of noderef expression
      net/sched/act_police.c:370:45: warning: dereference of noderef expression
      net/sched/act_police.c:376:45: warning: dereference of noderef expression
      net/sched/act_police.c:376:45: warning: dereference of noderef expression
      
      Fixes: d1967e49 ("net_sched: act_police: add 2 new attributes to support police 64bit rate and peakrate")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      682881ee
    • Eelco Chaudron's avatar
      net: openvswitch: fix upcall counter access before allocation · de9df6c6
      Eelco Chaudron authored
      Currently, the per cpu upcall counters are allocated after the vport is
      created and inserted into the system. This could lead to the datapath
      accessing the counters before they are allocated resulting in a kernel
      Oops.
      
      Here is an example:
      
        PID: 59693    TASK: ffff0005f4f51500  CPU: 0    COMMAND: "ovs-vswitchd"
         #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4
         #1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc
         #2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60
         #3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58
         #4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388
         #5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c
         #6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68
         #7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch]
         ...
      
        PID: 58682    TASK: ffff0005b2f0bf00  CPU: 0    COMMAND: "kworker/0:3"
         #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758
         #1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994
         #2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8
         #3 [ffff80000a5d3120] die at ffffb70f0628234c
         #4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8
         #5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4
         #6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4
         #7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710
         #8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74
         #9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac
        #10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24
        #11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc
        #12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch]
        #13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch]
        #14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch]
        #15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch]
        #16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch]
        #17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90
      
      We moved the per cpu upcall counter allocation to the existing vport
      alloc and free functions to solve this.
      
      Fixes: 95637d91 ("net: openvswitch: release vport resources on failure")
      Fixes: 1933ea36 ("net: openvswitch: Add support to count upcall packets")
      Signed-off-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Acked-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de9df6c6
    • Eric Dumazet's avatar
      net: sched: move rtm_tca_policy declaration to include file · 886bc7d6
      Eric Dumazet authored
      rtm_tca_policy is used from net/sched/sch_api.c and net/sched/cls_api.c,
      thus should be declared in an include file.
      
      This fixes the following sparse warning:
      net/sched/sch_api.c:1434:25: warning: symbol 'rtm_tca_policy' was not declared. Should it be static?
      
      Fixes: e331473f ("net/sched: cls_api: add missing validation of netlink attributes")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      886bc7d6
    • Michal Schmidt's avatar
      ice: make writes to /dev/gnssX synchronous · bf15bb38
      Michal Schmidt authored
      The current ice driver's GNSS write implementation buffers writes and
      works through them asynchronously in a kthread. That's bad because:
       - The GNSS write_raw operation is supposed to be synchronous[1][2].
       - There is no upper bound on the number of pending writes.
         Userspace can submit writes much faster than the driver can process,
         consuming unlimited amounts of kernel memory.
      
      A patch that's currently on review[3] ("[v3,net] ice: Write all GNSS
      buffers instead of first one") would add one more problem:
       - The possibility of waiting for a very long time to flush the write
         work when doing rmmod, softlockups.
      
      To fix these issues, simplify the implementation: Drop the buffering,
      the write_work, and make the writes synchronous.
      
      I tested this with gpsd and ubxtool.
      
      [1] https://events19.linuxfoundation.org/wp-content/uploads/2017/12/The-GNSS-Subsystem-Johan-Hovold-Hovold-Consulting-AB.pdf
          "User interface" slide.
      [2] A comment in drivers/gnss/core.c:gnss_write():
              /* Ignoring O_NONBLOCK, write_raw() is synchronous. */
      [3] https://patchwork.ozlabs.org/project/intel-wired-lan/patch/20230217120541.16745-1-karol.kolacinski@intel.com/
      
      Fixes: d6b98c8d ("ice: add write functionality for GNSS TTY")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf15bb38
    • Eric Dumazet's avatar
      net: sched: add rcu annotations around qdisc->qdisc_sleeping · d636fc5d
      Eric Dumazet authored
      syzbot reported a race around qdisc->qdisc_sleeping [1]
      
      It is time we add proper annotations to reads and writes to/from
      qdisc->qdisc_sleeping.
      
      [1]
      BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu
      
      read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1:
      qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331
      __tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174
      tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547
      rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386
      netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
      rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
      netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
      netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
      netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
      sock_sendmsg_nosec net/socket.c:724 [inline]
      sock_sendmsg net/socket.c:747 [inline]
      ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
      ___sys_sendmsg net/socket.c:2557 [inline]
      __sys_sendmsg+0x1e3/0x270 net/socket.c:2586
      __do_sys_sendmsg net/socket.c:2595 [inline]
      __se_sys_sendmsg net/socket.c:2593 [inline]
      __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0:
      dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115
      qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103
      tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693
      rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395
      netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
      rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
      netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
      netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
      netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
      sock_sendmsg_nosec net/socket.c:724 [inline]
      sock_sendmsg net/socket.c:747 [inline]
      ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
      ___sys_sendmsg net/socket.c:2557 [inline]
      __sys_sendmsg+0x1e3/0x270 net/socket.c:2586
      __do_sys_sendmsg net/socket.c:2595 [inline]
      __se_sys_sendmsg net/socket.c:2593 [inline]
      __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023
      
      Fixes: 3a7d0d07 ("net: sched: extend Qdisc with rcu")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Vlad Buslov <vladbu@nvidia.com>
      Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d636fc5d
    • David S. Miller's avatar
      Merge branch 'rfs-lockless-annotate' · e3144ff5
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      rfs: annotate lockless accesses
      
      rfs runs without locks held, so we should annotate
      read and writes to shared variables.
      
      It should prevent compilers forcing writes
      in the following situation:
      
        if (var != val)
           var = val;
      
      A compiler could indeed simply avoid the conditional:
      
          var = val;
      
      This matters if var is shared between many cpus.
      
      v2: aligns one closing bracket (Simon)
          adds Fixes: tags (Jakub)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3144ff5
    • Eric Dumazet's avatar
      rfs: annotate lockless accesses to RFS sock flow table · 5c3b74a9
      Eric Dumazet authored
      Add READ_ONCE()/WRITE_ONCE() on accesses to the sock flow table.
      
      This also prevents a (smart ?) compiler to remove the condition in:
      
      if (table->ents[index] != newval)
              table->ents[index] = newval;
      
      We need the condition to avoid dirtying a shared cache line.
      
      Fixes: fec5e652 ("rfs: Receive Flow Steering")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c3b74a9
    • Eric Dumazet's avatar
      rfs: annotate lockless accesses to sk->sk_rxhash · 1e5c647c
      Eric Dumazet authored
      Add READ_ONCE()/WRITE_ONCE() on accesses to sk->sk_rxhash.
      
      This also prevents a (smart ?) compiler to remove the condition in:
      
      if (sk->sk_rxhash != newval)
      	sk->sk_rxhash = newval;
      
      We need the condition to avoid dirtying a shared cache line.
      
      Fixes: fec5e652 ("rfs: Receive Flow Steering")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e5c647c
    • Jakub Kicinski's avatar
      Merge tag 'for-net-2023-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · ab39b113
      Jakub Kicinski authored
      Luiz Augusto von Dentz says:
      
      ====================
      bluetooth pull request for net:
      
       - Fixes to debugfs registration
       - Fix use-after-free in hci_remove_ltk/hci_remove_irk
       - Fixes to ISO channel support
       - Fix missing checks for invalid L2CAP DCID
       - Fix l2cap_disconnect_req deadlock
       - Add lock to protect HCI_UNREGISTER
      
      * tag 'for-net-2023-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
        Bluetooth: L2CAP: Add missing checks for invalid DCID
        Bluetooth: ISO: use correct CIS order in Set CIG Parameters event
        Bluetooth: ISO: don't try to remove CIG if there are bound CIS left
        Bluetooth: Fix l2cap_disconnect_req deadlock
        Bluetooth: hci_qca: fix debugfs registration
        Bluetooth: fix debugfs registration
        Bluetooth: hci_sync: add lock to protect HCI_UNREGISTER
        Bluetooth: Fix use-after-free in hci_remove_ltk/hci_remove_irk
        Bluetooth: ISO: Fix CIG auto-allocation to select configurable CIG
        Bluetooth: ISO: consider right CIS when removing CIG at cleanup
      ====================
      
      Link: https://lore.kernel.org/r/20230606003454.2392552-1-luiz.dentz@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ab39b113
    • Jakub Kicinski's avatar
      Merge tag 'nf-23-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 20c47646
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Missing nul-check in basechain hook netlink dump path, from Gavrilov Ilia.
      
      2) Fix bitwise register tracking, from Jeremy Sowden.
      
      3) Null pointer dereference when accessing conntrack helper,
         from Tijs Van Buggenhout.
      
      4) Add schedule point to ipset's call_ad, from Kuniyuki Iwashima.
      
      5) Incorrect boundary check when building chain blob.
      
      * tag 'nf-23-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nf_tables: out-of-bound check in chain blob
        netfilter: ipset: Add schedule point in call_ad().
        netfilter: conntrack: fix NULL pointer dereference in nf_confirm_cthelper
        netfilter: nft_bitwise: fix register tracking
        netfilter: nf_tables: Add null check for nla_nest_start_noflag() in nft_dump_basechain_hook()
      ====================
      
      Link: https://lore.kernel.org/r/20230606225851.67394-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      20c47646
    • Jakub Kicinski's avatar
      Merge tag 'wireless-2023-06-06' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless · e684ab76
      Jakub Kicinski authored
      Kalle Valo says:
      
      ====================
      wireless fixes for v6.4
      
      Both rtw88 and rtw89 have a 802.11 powersave fix for a regression
      introduced in v6.0. mt76 fixes a race and a null pointer dereference.
      iwlwifi fixes an issue where not enough memory was allocated for a
      firmware event. And finally the stack has several smaller fixes all
      over.
      
      * tag 'wireless-2023-06-06' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
        wifi: cfg80211: fix locking in regulatory disconnect
        wifi: cfg80211: fix locking in sched scan stop work
        wifi: iwlwifi: mvm: Fix -Warray-bounds bug in iwl_mvm_wait_d3_notif()
        wifi: mac80211: fix switch count in EMA beacons
        wifi: mac80211: don't translate beacon/presp addrs
        wifi: mac80211: mlme: fix non-inheritence element
        wifi: cfg80211: reject bad AP MLD address
        wifi: mac80211: use correct iftype HE cap
        wifi: mt76: mt7996: fix possible NULL pointer dereference in mt7996_mac_write_txwi()
        wifi: rtw89: remove redundant check of entering LPS
        wifi: rtw89: correct PS calculation for SUPPORTS_DYNAMIC_PS
        wifi: rtw88: correct PS calculation for SUPPORTS_DYNAMIC_PS
        wifi: mt76: mt7615: fix possible race in mt7615_mac_sta_poll
      ====================
      
      Link: https://lore.kernel.org/r/20230606150817.EC133C433D2@smtp.kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e684ab76
    • Brett Creeley's avatar
      virtio_net: use control_buf for coalesce params · accc1bf2
      Brett Creeley authored
      Commit 699b045a ("net: virtio_net: notifications coalescing
      support") added coalescing command support for virtio_net. However,
      the coalesce commands are using buffers on the stack, which is causing
      the device to see DMA errors. There should also be a complaint from
      check_for_stack() in debug_dma_map_xyz(). Fix this by adding and using
      coalesce params from the control_buf struct, which aligns with other
      commands.
      
      Cc: stable@vger.kernel.org
      Fixes: 699b045a ("net: virtio_net: notifications coalescing support")
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarAllen Hubbe <allen.hubbe@amd.com>
      Signed-off-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Link: https://lore.kernel.org/r/20230605195925.51625-1-brett.creeley@amd.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      accc1bf2
    • Brett Creeley's avatar
      pds_core: Fix FW recovery detection · 4f48c303
      Brett Creeley authored
      Commit 523847df ("pds_core: add devcmd device interfaces") included
      initial support for FW recovery detection. Unfortunately, the ordering
      in pdsc_is_fw_good() was incorrect, which was causing FW recovery to be
      undetected by the driver. Fix this by making sure to update the cached
      fw_status by calling pdsc_is_fw_running() before setting the local FW
      gen.
      
      Fixes: 523847df ("pds_core: add devcmd device interfaces")
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230605195116.49653-1-brett.creeley@amd.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4f48c303
    • Eric Dumazet's avatar
      tcp: gso: really support BIG TCP · 82a01ab3
      Eric Dumazet authored
      We missed that tcp_gso_segment() was assuming skb->len was smaller than 65535 :
      
      oldlen = (u16)~skb->len;
      
      This part came with commit 0718bcc0 ("[NET]: Fix CHECKSUM_HW GSO problems.")
      
      This leads to wrong TCP checksum.
      
      Adapt the code to accept arbitrary packet length.
      
      v2:
        - use two csum_add() instead of csum_fold() (Alexander Duyck)
        - Change delta type to __wsum to reduce casts (Alexander Duyck)
      
      Fixes: 09f3d1a3 ("ipv6/gso: remove temporary HBH/jumbo header")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230605161647.3624428-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      82a01ab3
    • Kuniyuki Iwashima's avatar
      ipv6: rpl: Fix Route of Death. · a2f4c143
      Kuniyuki Iwashima authored
      A remote DoS vulnerability of RPL Source Routing is assigned CVE-2023-2156.
      
      The Source Routing Header (SRH) has the following format:
      
        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |  Next Header  |  Hdr Ext Len  | Routing Type  | Segments Left |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        | CmprI | CmprE |  Pad  |               Reserved                |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
        |                                                               |
        .                                                               .
        .                        Addresses[1..n]                        .
        .                                                               .
        |                                                               |
        +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      
      The originator of an SRH places the first hop's IPv6 address in the IPv6
      header's IPv6 Destination Address and the second hop's IPv6 address as
      the first address in Addresses[1..n].
      
      The CmprI and CmprE fields indicate the number of prefix octets that are
      shared with the IPv6 Destination Address.  When CmprI or CmprE is not 0,
      Addresses[1..n] are compressed as follows:
      
        1..n-1 : (16 - CmprI) bytes
             n : (16 - CmprE) bytes
      
      Segments Left indicates the number of route segments remaining.  When the
      value is not zero, the SRH is forwarded to the next hop.  Its address
      is extracted from Addresses[n - Segment Left + 1] and swapped with IPv6
      Destination Address.
      
      When Segment Left is greater than or equal to 2, the size of SRH is not
      changed because Addresses[1..n-1] are decompressed and recompressed with
      CmprI.
      
      OTOH, when Segment Left changes from 1 to 0, the new SRH could have a
      different size because Addresses[1..n-1] are decompressed with CmprI and
      recompressed with CmprE.
      
      Let's say CmprI is 15 and CmprE is 0.  When we receive SRH with Segment
      Left >= 2, Addresses[1..n-1] have 1 byte for each, and Addresses[n] has
      16 bytes.  When Segment Left is 1, Addresses[1..n-1] is decompressed to
      16 bytes and not recompressed.  Finally, the new SRH will need more room
      in the header, and the size is (16 - 1) * (n - 1) bytes.
      
      Here the max value of n is 255 as Segment Left is u8, so in the worst case,
      we have to allocate 3825 bytes in the skb headroom.  However, now we only
      allocate a small fixed buffer that is IPV6_RPL_SRH_WORST_SWAP_SIZE (16 + 7
      bytes).  If the decompressed size overflows the room, skb_push() hits BUG()
      below [0].
      
      Instead of allocating the fixed buffer for every packet, let's allocate
      enough headroom only when we receive SRH with Segment Left 1.
      
      [0]:
      skbuff: skb_under_panic: text:ffffffff81c9f6e2 len:576 put:576 head:ffff8880070b5180 data:ffff8880070b4fb0 tail:0x70 end:0x140 dev:lo
      kernel BUG at net/core/skbuff.c:200!
      invalid opcode: 0000 [#1] PREEMPT SMP PTI
      CPU: 0 PID: 154 Comm: python3 Not tainted 6.4.0-rc4-00190-gc308e9ec #7
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:skb_panic (net/core/skbuff.c:200)
      Code: 4f 70 50 8b 87 bc 00 00 00 50 8b 87 b8 00 00 00 50 ff b7 c8 00 00 00 4c 8b 8f c0 00 00 00 48 c7 c7 80 6e 77 82 e8 ad 8b 60 ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
      RSP: 0018:ffffc90000003da0 EFLAGS: 00000246
      RAX: 0000000000000085 RBX: ffff8880058a6600 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff88807dc1c540 RDI: ffff88807dc1c540
      RBP: ffffc90000003e48 R08: ffffffff82b392c8 R09: 00000000ffffdfff
      R10: ffffffff82a592e0 R11: ffffffff82b092e0 R12: ffff888005b1c800
      R13: ffff8880070b51b8 R14: ffff888005b1ca18 R15: ffff8880070b5190
      FS:  00007f4539f0b740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055670baf3000 CR3: 0000000005b0e000 CR4: 00000000007506f0
      PKRU: 55555554
      Call Trace:
       <IRQ>
       skb_push (net/core/skbuff.c:210)
       ipv6_rthdr_rcv (./include/linux/skbuff.h:2880 net/ipv6/exthdrs.c:634 net/ipv6/exthdrs.c:718)
       ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:437 (discriminator 5))
       ip6_input_finish (./include/linux/rcupdate.h:805 net/ipv6/ip6_input.c:483)
       __netif_receive_skb_one_core (net/core/dev.c:5494)
       process_backlog (./include/linux/rcupdate.h:805 net/core/dev.c:5934)
       __napi_poll (net/core/dev.c:6496)
       net_rx_action (net/core/dev.c:6565 net/core/dev.c:6696)
       __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:572)
       do_softirq (kernel/softirq.c:472 kernel/softirq.c:459)
       </IRQ>
       <TASK>
       __local_bh_enable_ip (kernel/softirq.c:396)
       __dev_queue_xmit (net/core/dev.c:4272)
       ip6_finish_output2 (./include/net/neighbour.h:544 net/ipv6/ip6_output.c:134)
       rawv6_sendmsg (./include/net/dst.h:458 ./include/linux/netfilter.h:303 net/ipv6/raw.c:656 net/ipv6/raw.c:914)
       sock_sendmsg (net/socket.c:724 net/socket.c:747)
       __sys_sendto (net/socket.c:2144)
       __x64_sys_sendto (net/socket.c:2156 net/socket.c:2152 net/socket.c:2152)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
      RIP: 0033:0x7f453a138aea
      Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
      RSP: 002b:00007ffcc212a1c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007ffcc212a288 RCX: 00007f453a138aea
      RDX: 0000000000000060 RSI: 00007f4539084c20 RDI: 0000000000000003
      RBP: 00007f4538308e80 R08: 00007ffcc212a300 R09: 000000000000001c
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007f4539712d1b
       </TASK>
      Modules linked in:
      
      Fixes: 8610c7c6 ("net: ipv6: add support for rpl sr exthdr")
      Reported-by: Max VA
      Closes: https://www.interruptlabs.co.uk/articles/linux-ipv6-route-of-deathSigned-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20230605180617.67284-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a2f4c143
    • Jakub Kicinski's avatar
      netlink: specs: ethtool: fix random typos · f6ca5baf
      Jakub Kicinski authored
      Working on the code gen for C reveals typos in the ethtool spec
      as the compiler tries to find the names in the existing uAPI
      header. Fix the mistakes.
      
      Fixes: a353318e ("tools: ynl: populate most of the ethtool spec")
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20230605233257.843977-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f6ca5baf
  6. 06 Jun, 2023 9 commits
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: out-of-bound check in chain blob · 08e42a0d
      Pablo Neira Ayuso authored
      Add current size of rule expressions to the boundary check.
      
      Fixes: 2c865a8a ("netfilter: nf_tables: add rule blob layout")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      08e42a0d
    • Kuniyuki Iwashima's avatar
      netfilter: ipset: Add schedule point in call_ad(). · 24e22789
      Kuniyuki Iwashima authored
      syzkaller found a repro that causes Hung Task [0] with ipset.  The repro
      first creates an ipset and then tries to delete a large number of IPs
      from the ipset concurrently:
      
        IPSET_ATTR_IPADDR_IPV4 : 172.20.20.187
        IPSET_ATTR_CIDR        : 2
      
      The first deleting thread hogs a CPU with nfnl_lock(NFNL_SUBSYS_IPSET)
      held, and other threads wait for it to be released.
      
      Previously, the same issue existed in set->variant->uadt() that could run
      so long under ip_set_lock(set).  Commit 5e29dc36 ("netfilter: ipset:
      Rework long task execution when adding/deleting entries") tried to fix it,
      but the issue still exists in the caller with another mutex.
      
      While adding/deleting many IPs, we should release the CPU periodically to
      prevent someone from abusing ipset to hang the system.
      
      Note we need to increment the ipset's refcnt to prevent the ipset from
      being destroyed while rescheduling.
      
      [0]:
      INFO: task syz-executor174:268 blocked for more than 143 seconds.
            Not tainted 6.4.0-rc1-00145-gba79e9a7 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:syz-executor174 state:D stack:0     pid:268   ppid:260    flags:0x0000000d
      Call trace:
       __switch_to+0x308/0x714 arch/arm64/kernel/process.c:556
       context_switch kernel/sched/core.c:5343 [inline]
       __schedule+0xd84/0x1648 kernel/sched/core.c:6669
       schedule+0xf0/0x214 kernel/sched/core.c:6745
       schedule_preempt_disabled+0x58/0xf0 kernel/sched/core.c:6804
       __mutex_lock_common kernel/locking/mutex.c:679 [inline]
       __mutex_lock+0x6fc/0xdb0 kernel/locking/mutex.c:747
       __mutex_lock_slowpath+0x14/0x20 kernel/locking/mutex.c:1035
       mutex_lock+0x98/0xf0 kernel/locking/mutex.c:286
       nfnl_lock net/netfilter/nfnetlink.c:98 [inline]
       nfnetlink_rcv_msg+0x480/0x70c net/netfilter/nfnetlink.c:295
       netlink_rcv_skb+0x1c0/0x350 net/netlink/af_netlink.c:2546
       nfnetlink_rcv+0x18c/0x199c net/netfilter/nfnetlink.c:658
       netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
       netlink_unicast+0x664/0x8cc net/netlink/af_netlink.c:1365
       netlink_sendmsg+0x6d0/0xa4c net/netlink/af_netlink.c:1913
       sock_sendmsg_nosec net/socket.c:724 [inline]
       sock_sendmsg net/socket.c:747 [inline]
       ____sys_sendmsg+0x4b8/0x810 net/socket.c:2503
       ___sys_sendmsg net/socket.c:2557 [inline]
       __sys_sendmsg+0x1f8/0x2a4 net/socket.c:2586
       __do_sys_sendmsg net/socket.c:2595 [inline]
       __se_sys_sendmsg net/socket.c:2593 [inline]
       __arm64_sys_sendmsg+0x80/0x94 net/socket.c:2593
       __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
       invoke_syscall+0x84/0x270 arch/arm64/kernel/syscall.c:52
       el0_svc_common+0x134/0x24c arch/arm64/kernel/syscall.c:142
       do_el0_svc+0x64/0x198 arch/arm64/kernel/syscall.c:193
       el0_svc+0x2c/0x7c arch/arm64/kernel/entry-common.c:637
       el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:655
       el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Fixes: a7b4f989 ("netfilter: ipset: IP set core support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      24e22789
    • Tijs Van Buggenhout's avatar
      netfilter: conntrack: fix NULL pointer dereference in nf_confirm_cthelper · e1f543dc
      Tijs Van Buggenhout authored
      An nf_conntrack_helper from nf_conn_help may become NULL after DNAT.
      
      Observed when TCP port 1720 (Q931_PORT), associated with h323 conntrack
      helper, is DNAT'ed to another destination port (e.g. 1730), while
      nfqueue is being used for final acceptance (e.g. snort).
      
      This happenned after transition from kernel 4.14 to 5.10.161.
      
      Workarounds:
       * keep the same port (1720) in DNAT
       * disable nfqueue
       * disable/unload h323 NAT helper
      
      $ linux-5.10/scripts/decode_stacktrace.sh vmlinux < /tmp/kernel.log
      BUG: kernel NULL pointer dereference, address: 0000000000000084
      [..]
      RIP: 0010:nf_conntrack_update (net/netfilter/nf_conntrack_core.c:2080 net/netfilter/nf_conntrack_core.c:2134) nf_conntrack
      [..]
      nfqnl_reinject (net/netfilter/nfnetlink_queue.c:237) nfnetlink_queue
      nfqnl_recv_verdict (net/netfilter/nfnetlink_queue.c:1230) nfnetlink_queue
      nfnetlink_rcv_msg (net/netfilter/nfnetlink.c:241) nfnetlink
      [..]
      
      Fixes: ee04805f ("netfilter: conntrack: make conntrack userspace helpers work again")
      Signed-off-by: default avatarTijs Van Buggenhout <tijs.van.buggenhout@axsguard.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e1f543dc
    • Jeremy Sowden's avatar
      netfilter: nft_bitwise: fix register tracking · 14e8b293
      Jeremy Sowden authored
      At the end of `nft_bitwise_reduce`, there is a loop which is intended to
      update the bitwise expression associated with each tracked destination
      register.  However, currently, it just updates the first register
      repeatedly.  Fix it.
      
      Fixes: 34cc9e52 ("netfilter: nf_tables: cancel tracking for clobbered destination registers")
      Signed-off-by: default avatarJeremy Sowden <jeremy@azazel.net>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      14e8b293
    • Gavrilov Ilia's avatar
      netfilter: nf_tables: Add null check for nla_nest_start_noflag() in nft_dump_basechain_hook() · bd058763
      Gavrilov Ilia authored
      The nla_nest_start_noflag() function may fail and return NULL;
      the return value needs to be checked.
      
      Found by InfoTeCS on behalf of Linux Verification Center
      (linuxtesting.org) with SVACE.
      
      Fixes: d54725cd ("netfilter: nf_tables: support for multiple devices per netdev hook")
      Signed-off-by: default avatarGavrilov Ilia <Ilia.Gavrilov@infotecs.ru>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      bd058763
    • Yonghong Song's avatar
      selftests/bpf: Fix sockopt_sk selftest · 69844e33
      Yonghong Song authored
      Commit f4e45348 ("net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report")
      fixed NETLINK_LIST_MEMBERSHIPS length report which caused
      selftest sockopt_sk failure. The failure log looks like
      
        test_sockopt_sk:PASS:join_cgroup /sockopt_sk 0 nsec
        run_test:PASS:skel_load 0 nsec
        run_test:PASS:setsockopt_link 0 nsec
        run_test:PASS:getsockopt_link 0 nsec
        getsetsockopt:FAIL:Unexpected NETLINK_LIST_MEMBERSHIPS value unexpected Unexpected NETLINK_LIST_MEMBERSHIPS value: actual 8 != expected 4
        run_test:PASS:getsetsockopt 0 nsec
        #201     sockopt_sk:FAIL
      
      In net/netlink/af_netlink.c, function netlink_getsockopt(), for NETLINK_LIST_MEMBERSHIPS,
      nlk->ngroups equals to 36. Before Commit f4e45348, the optlen is calculated as
        ALIGN(nlk->ngroups / 8, sizeof(u32)) = 4
      After that commit, the optlen is
        ALIGN(BITS_TO_BYTES(nlk->ngroups), sizeof(u32)) = 8
      
      Fix the test by setting the expected optlen to be 8.
      
      Fixes: f4e45348 ("net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230606172202.1606249-1-yhs@fb.com
      69844e33
    • Johannes Berg's avatar
      wifi: cfg80211: fix locking in regulatory disconnect · f7e60032
      Johannes Berg authored
      This should use wiphy_lock() now instead of requiring the
      RTNL, since __cfg80211_leave() via cfg80211_leave() is now
      requiring that lock to be held.
      
      Fixes: a05829a7 ("cfg80211: avoid holding the RTNL when calling the driver")
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      f7e60032
    • Johannes Berg's avatar
      wifi: cfg80211: fix locking in sched scan stop work · 3e54ed82
      Johannes Berg authored
      This should use wiphy_lock() now instead of acquiring the
      RTNL, since cfg80211_stop_sched_scan_req() now needs that.
      
      Fixes: a05829a7 ("cfg80211: avoid holding the RTNL when calling the driver")
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      3e54ed82
    • Manish Chopra's avatar
      qed/qede: Fix scheduling while atomic · 42510dff
      Manish Chopra authored
      Statistics read through bond interface via sysfs causes
      below bug and traces as it triggers the bonding module to
      collect the slave device statistics while holding the spinlock,
      beneath that qede->qed driver statistics flow gets scheduled out
      due to usleep_range() used in PTT acquire logic
      
      [ 3673.988874] Hardware name: HPE ProLiant DL365 Gen10 Plus/ProLiant DL365 Gen10 Plus, BIOS A42 10/29/2021
      [ 3673.988878] Call Trace:
      [ 3673.988891]  dump_stack_lvl+0x34/0x44
      [ 3673.988908]  __schedule_bug.cold+0x47/0x53
      [ 3673.988918]  __schedule+0x3fb/0x560
      [ 3673.988929]  schedule+0x43/0xb0
      [ 3673.988932]  schedule_hrtimeout_range_clock+0xbf/0x1b0
      [ 3673.988937]  ? __hrtimer_init+0xc0/0xc0
      [ 3673.988950]  usleep_range+0x5e/0x80
      [ 3673.988955]  qed_ptt_acquire+0x2b/0xd0 [qed]
      [ 3673.988981]  _qed_get_vport_stats+0x141/0x240 [qed]
      [ 3673.989001]  qed_get_vport_stats+0x18/0x80 [qed]
      [ 3673.989016]  qede_fill_by_demand_stats+0x37/0x400 [qede]
      [ 3673.989028]  qede_get_stats64+0x19/0xe0 [qede]
      [ 3673.989034]  dev_get_stats+0x5c/0xc0
      [ 3673.989045]  netstat_show.constprop.0+0x52/0xb0
      [ 3673.989055]  dev_attr_show+0x19/0x40
      [ 3673.989065]  sysfs_kf_seq_show+0x9b/0xf0
      [ 3673.989076]  seq_read_iter+0x120/0x4b0
      [ 3673.989087]  new_sync_read+0x118/0x1a0
      [ 3673.989095]  vfs_read+0xf3/0x180
      [ 3673.989099]  ksys_read+0x5f/0xe0
      [ 3673.989102]  do_syscall_64+0x3b/0x90
      [ 3673.989109]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 3673.989115] RIP: 0033:0x7f8467d0b082
      [ 3673.989119] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ca 05 08 00 e8 35 e7 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
      [ 3673.989121] RSP: 002b:00007ffffb21fd08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      [ 3673.989127] RAX: ffffffffffffffda RBX: 000000000100eca0 RCX: 00007f8467d0b082
      [ 3673.989128] RDX: 00000000000003ff RSI: 00007ffffb21fdc0 RDI: 0000000000000003
      [ 3673.989130] RBP: 00007f8467b96028 R08: 0000000000000010 R09: 00007ffffb21ec00
      [ 3673.989132] R10: 00007ffffb27b170 R11: 0000000000000246 R12: 00000000000000f0
      [ 3673.989134] R13: 0000000000000003 R14: 00007f8467b92000 R15: 0000000000045a05
      [ 3673.989139] CPU: 30 PID: 285188 Comm: read_all Kdump: loaded Tainted: G        W  OE
      
      Fix this by collecting the statistics asynchronously from a periodic
      delayed work scheduled at default stats coalescing interval and return
      the recent copy of statisitcs from .ndo_get_stats64(), also add ability
      to configure/retrieve stats coalescing interval using below commands -
      
      ethtool -C ethx stats-block-usecs <val>
      ethtool -c ethx
      
      Fixes: 133fac0e ("qede: Add basic ethtool support")
      Cc: Sudarsana Kalluru <skalluru@marvell.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarManish Chopra <manishc@marvell.com>
      Link: https://lore.kernel.org/r/20230605112600.48238-1-manishc@marvell.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      42510dff