1. 22 Feb, 2024 19 commits
  2. 21 Feb, 2024 21 commits
    • Florian Westphal's avatar
      netfilter: nf_tables: use kzalloc for hook allocation · 195e5f88
      Florian Westphal authored
      KMSAN reports unitialized variable when registering the hook,
         reg->hook_ops_type == NF_HOOK_OP_BPF)
              ~~~~~~~~~~~ undefined
      
      This is a small structure, just use kzalloc to make sure this
      won't happen again when new fields get added to nf_hook_ops.
      
      Fixes: 7b4b2fa3 ("netfilter: annotate nf_tables base hook ops")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      195e5f88
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: register hooks last when adding new chain/flowtable · d472e985
      Pablo Neira Ayuso authored
      Register hooks last when adding chain/flowtable to ensure that packets do
      not walk over datastructure that is being released in the error path
      without waiting for the rcu grace period.
      
      Fixes: 91c7b38d ("netfilter: nf_tables: use new transaction infrastructure to handle chain")
      Fixes: 3b49e2e9 ("netfilter: nf_tables: add flow table netlink frontend")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d472e985
    • Pablo Neira Ayuso's avatar
      netfilter: nft_flow_offload: release dst in case direct xmit path is used · 8762785f
      Pablo Neira Ayuso authored
      Direct xmit does not use it since it calls dev_queue_xmit() to send
      packets, hence it calls dst_release().
      
      kmemleak reports:
      
      unreferenced object 0xffff88814f440900 (size 184):
        comm "softirq", pid 0, jiffies 4294951896
        hex dump (first 32 bytes):
          00 60 5b 04 81 88 ff ff 00 e6 e8 82 ff ff ff ff  .`[.............
          21 0b 50 82 ff ff ff ff 00 00 00 00 00 00 00 00  !.P.............
        backtrace (crc cb2bf5d6):
          [<000000003ee17107>] kmem_cache_alloc+0x286/0x340
          [<0000000021a5de2c>] dst_alloc+0x43/0xb0
          [<00000000f0671159>] rt_dst_alloc+0x2e/0x190
          [<00000000fe5092c9>] __mkroute_output+0x244/0x980
          [<000000005fb96fb0>] ip_route_output_flow+0xc0/0x160
          [<0000000045367433>] nf_ip_route+0xf/0x30
          [<0000000085da1d8e>] nf_route+0x2d/0x60
          [<00000000d1ecd1cb>] nft_flow_route+0x171/0x6a0 [nft_flow_offload]
          [<00000000d9b2fb60>] nft_flow_offload_eval+0x4e8/0x700 [nft_flow_offload]
          [<000000009f447dbb>] expr_call_ops_eval+0x53/0x330 [nf_tables]
          [<00000000072e1be6>] nft_do_chain+0x17c/0x840 [nf_tables]
          [<00000000d0551029>] nft_do_chain_inet+0xa1/0x210 [nf_tables]
          [<0000000097c9d5c6>] nf_hook_slow+0x5b/0x160
          [<0000000005eccab1>] ip_forward+0x8b6/0x9b0
          [<00000000553a269b>] ip_rcv+0x221/0x230
          [<00000000412872e5>] __netif_receive_skb_one_core+0xfe/0x110
      
      Fixes: fa502c86 ("netfilter: flowtable: simplify route logic")
      Reported-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      8762785f
    • Pablo Neira Ayuso's avatar
      netfilter: nft_flow_offload: reset dst in route object after setting up flow · 9e0f0430
      Pablo Neira Ayuso authored
      dst is transferred to the flow object, route object does not own it
      anymore.  Reset dst in route object, otherwise if flow_offload_add()
      fails, error path releases dst twice, leading to a refcount underflow.
      
      Fixes: a3c90f7a ("netfilter: nf_tables: flow offload expression")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      9e0f0430
    • Florian Westphal's avatar
      netfilter: nf_tables: set dormant flag on hook register failure · bccebf64
      Florian Westphal authored
      We need to set the dormant flag again if we fail to register
      the hooks.
      
      During memory pressure hook registration can fail and we end up
      with a table marked as active but no registered hooks.
      
      On table/base chain deletion, nf_tables will attempt to unregister
      the hook again which yields a warn splat from the nftables core.
      
      Reported-and-tested-by: syzbot+de4025c006ec68ac56fc@syzkaller.appspotmail.com
      Fixes: 179d9ba5 ("netfilter: nf_tables: fix table flag updates")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      bccebf64
    • Jakub Kicinski's avatar
      Merge branch 'tls-fixes-for-record-type-handling-with-peek' · f76d5f65
      Jakub Kicinski authored
      Sabrina Dubroca says:
      
      ====================
      tls: fixes for record type handling with PEEK
      
      There are multiple bugs in tls_sw_recvmsg's handling of record types
      when MSG_PEEK flag is used, which can lead to incorrectly merging two
      records:
       - consecutive non-DATA records shouldn't be merged, even if they're
         the same type (partly handled by the test at the end of the main
         loop)
       - records of the same type (even DATA) shouldn't be merged if one
         record of a different type comes in between
      ====================
      
      Link: https://lore.kernel.org/r/cover.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f76d5f65
    • Sabrina Dubroca's avatar
      selftests: tls: add test for peeking past a record of a different type · 2bf61726
      Sabrina Dubroca authored
      If we queue 3 records:
       - record 1, type DATA
       - record 2, some other type
       - record 3, type DATA
      the current code can look past the 2nd record and merge the 2 data
      records.
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/4623550f8617c239581030c13402d3262f2bd14f.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2bf61726
    • Sabrina Dubroca's avatar
      selftests: tls: add test for merging of same-type control messages · 7b2a4c2a
      Sabrina Dubroca authored
      Two consecutive control messages of the same type should never be
      merged into one large received blob of data.
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/018f1633d5471684c65def5fe390de3b15c3d683.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7b2a4c2a
    • Sabrina Dubroca's avatar
      tls: don't skip over different type records from the rx_list · ec823bf3
      Sabrina Dubroca authored
      If we queue 3 records:
       - record 1, type DATA
       - record 2, some other type
       - record 3, type DATA
      and do a recv(PEEK), the rx_list will contain the first two records.
      
      The next large recv will walk through the rx_list and copy data from
      record 1, then stop because record 2 is a different type. Since we
      haven't filled up our buffer, we will process the next available
      record. It's also DATA, so we can merge it with the current read.
      
      We shouldn't do that, since there was a record in between that we
      ignored.
      
      Add a flag to let process_rx_list inform tls_sw_recvmsg that it had
      more data available.
      
      Fixes: 692d7b5d ("tls: Fix recvmsg() to be able to peek across multiple records")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/f00c0c0afa080c60f016df1471158c1caf983c34.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec823bf3
    • Sabrina Dubroca's avatar
      tls: stop recv() if initial process_rx_list gave us non-DATA · fdfbaec5
      Sabrina Dubroca authored
      If we have a non-DATA record on the rx_list and another record of the
      same type still on the queue, we will end up merging them:
       - process_rx_list copies the non-DATA record
       - we start the loop and process the first available record since it's
         of the same type
       - we break out of the loop since the record was not DATA
      
      Just check the record type and jump to the end in case process_rx_list
      did some work.
      
      Fixes: 692d7b5d ("tls: Fix recvmsg() to be able to peek across multiple records")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/bd31449e43bd4b6ff546f5c51cf958c31c511deb.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fdfbaec5
    • Sabrina Dubroca's avatar
      tls: break out of main loop when PEEK gets a non-data record · 10f41d07
      Sabrina Dubroca authored
      PEEK needs to leave decrypted records on the rx_list so that we can
      receive them later on, so it jumps back into the async code that
      queues the skb. Unfortunately that makes us skip the
      TLS_RECORD_TYPE_DATA check at the bottom of the main loop, so if two
      records of the same (non-DATA) type are queued, we end up merging
      them.
      
      Add the same record type check, and make it unlikely to not penalize
      the async fastpath. Async decrypt only applies to data record, so this
      check is only needed for PEEK.
      
      process_rx_list also has similar issues.
      
      Fixes: 692d7b5d ("tls: Fix recvmsg() to be able to peek across multiple records")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/3df2eef4fdae720c55e69472b5bea668772b45a2.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      10f41d07
    • Vasiliy Kovalev's avatar
      gtp: fix use-after-free and null-ptr-deref in gtp_genl_dump_pdp() · 136cfaca
      Vasiliy Kovalev authored
      The gtp_net_ops pernet operations structure for the subsystem must be
      registered before registering the generic netlink family.
      
      Syzkaller hit 'general protection fault in gtp_genl_dump_pdp' bug:
      
      general protection fault, probably for non-canonical address
      0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN NOPTI
      KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
      CPU: 1 PID: 5826 Comm: gtp Not tainted 6.8.0-rc3-std-def-alt1 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-alt1 04/01/2014
      RIP: 0010:gtp_genl_dump_pdp+0x1be/0x800 [gtp]
      Code: c6 89 c6 e8 64 e9 86 df 58 45 85 f6 0f 85 4e 04 00 00 e8 c5 ee 86
            df 48 8b 54 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80>
            3c 02 00 0f 85 de 05 00 00 48 8b 44 24 18 4c 8b 30 4c 39 f0 74
      RSP: 0018:ffff888014107220 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff88800fcda588 R14: 0000000000000001 R15: 0000000000000000
      FS:  00007f1be4eb05c0(0000) GS:ffff88806ce80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f1be4e766cf CR3: 000000000c33e000 CR4: 0000000000750ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? show_regs+0x90/0xa0
       ? die_addr+0x50/0xd0
       ? exc_general_protection+0x148/0x220
       ? asm_exc_general_protection+0x22/0x30
       ? gtp_genl_dump_pdp+0x1be/0x800 [gtp]
       ? __alloc_skb+0x1dd/0x350
       ? __pfx___alloc_skb+0x10/0x10
       genl_dumpit+0x11d/0x230
       netlink_dump+0x5b9/0xce0
       ? lockdep_hardirqs_on_prepare+0x253/0x430
       ? __pfx_netlink_dump+0x10/0x10
       ? kasan_save_track+0x10/0x40
       ? __kasan_kmalloc+0x9b/0xa0
       ? genl_start+0x675/0x970
       __netlink_dump_start+0x6fc/0x9f0
       genl_family_rcv_msg_dumpit+0x1bb/0x2d0
       ? __pfx_genl_family_rcv_msg_dumpit+0x10/0x10
       ? genl_op_from_small+0x2a/0x440
       ? cap_capable+0x1d0/0x240
       ? __pfx_genl_start+0x10/0x10
       ? __pfx_genl_dumpit+0x10/0x10
       ? __pfx_genl_done+0x10/0x10
       ? security_capable+0x9d/0xe0
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarVasiliy Kovalev <kovalev@altlinux.org>
      Fixes: 459aa660 ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)")
      Link: https://lore.kernel.org/r/20240214162733.34214-1-kovalev@altlinux.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      136cfaca
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 39133352
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "Two fixes for ARM ITS emulation. Unmapped interrupts were used instead
        of ignored, causing NULL pointer dereferences"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: arm64: vgic-its: Test for valid IRQ in MOVALL handler
        KVM: arm64: vgic-its: Test for valid IRQ in its_sync_lpi_pending_table()
      39133352
    • Linus Torvalds's avatar
      Merge tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 8da8d884
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - Fix a deadlock in fiemap.
      
         There was a big lock around the whole operation that can interfere
         with a page fault and mkwrite.
      
         Reducing the lock scope can also speed up fiemap
      
       - Fix range condition for extent defragmentation which could lead to
         worse layout in some cases
      
      * tag 'for-6.8-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fix deadlock with fiemap and extent locking
        btrfs: defrag: avoid unnecessary defrag caused by incorrect extent size
      8da8d884
    • Linus Torvalds's avatar
      Merge tag 'v6.8-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · d8be5a55
      Linus Torvalds authored
      Pull crypto fix from Herbert Xu:
       "Fix a stack overflow in virtio"
      
      * tag 'v6.8-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: virtio/akcipher - Fix stack overflow on memcpy
      d8be5a55
    • Shigeru Yoshida's avatar
      bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready() · 4cd12c60
      Shigeru Yoshida authored
      syzbot reported the following NULL pointer dereference issue [1]:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        [...]
        RIP: 0010:0x0
        [...]
        Call Trace:
         <TASK>
         sk_psock_verdict_data_ready+0x232/0x340 net/core/skmsg.c:1230
         unix_stream_sendmsg+0x9b4/0x1230 net/unix/af_unix.c:2293
         sock_sendmsg_nosec net/socket.c:730 [inline]
         __sock_sendmsg+0x221/0x270 net/socket.c:745
         ____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
         ___sys_sendmsg net/socket.c:2638 [inline]
         __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
         do_syscall_64+0xf9/0x240
         entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      If sk_psock_verdict_data_ready() and sk_psock_stop_verdict() are called
      concurrently, psock->saved_data_ready can be NULL, causing the above issue.
      
      This patch fixes this issue by calling the appropriate data ready function
      using the sk_psock_data_ready() helper and protecting it from concurrency
      with sk->sk_callback_lock.
      
      Fixes: 6df7f764 ("bpf, sockmap: Wake up polling after data copy")
      Reported-by: syzbot+fd7b34375c1c8ce29c93@syzkaller.appspotmail.com
      Signed-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: syzbot+fd7b34375c1c8ce29c93@syzkaller.appspotmail.com
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=fd7b34375c1c8ce29c93 [1]
      Link: https://lore.kernel.org/bpf/20240218150933.6004-1-syoshida@redhat.com
      4cd12c60
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Do not let subbuf be bigger than write mask · e78fb4ea
      Steven Rostedt (Google) authored
      The data on the subbuffer is measured by a write variable that also
      contains status flags. The counter is just 20 bits in length. If the
      subbuffer is bigger than then counter, it will fail.
      
      Make sure that the subbuffer can not be set to greater than the counter
      that keeps track of the data on the subbuffer.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20240220095112.77e9cb81@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: 2808e31e ("ring-buffer: Add interface for configuring trace sub buffer size")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      e78fb4ea
    • Simon Horman's avatar
      MAINTAINERS: Add framer headers to NETWORKING [GENERAL] · 14dec56f
      Simon Horman authored
      The cited commit [1] added framer support under drivers/net/wan,
      which is covered by NETWORKING [GENERAL]. And it is implied
      that framer-provider.h and framer.h, which were also added
      buy the same patch, are also maintained as part of NETWORKING [GENERAL].
      
      Make this explicit by adding these files to the corresponding
      section in MAINTAINERS.
      
      [1] 82c944d0 ("net: wan: Add framer framework support")
      Signed-off-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarHerve Codina <herve.codina@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14dec56f
    • Kuniyuki Iwashima's avatar
      af_unix: Drop oob_skb ref before purging queue in GC. · aa82ac51
      Kuniyuki Iwashima authored
      syzbot reported another task hung in __unix_gc().  [0]
      
      The current while loop assumes that all of the left candidates
      have oob_skb and calling kfree_skb(oob_skb) releases the remaining
      candidates.
      
      However, I missed a case that oob_skb has self-referencing fd and
      another fd and the latter sk is placed before the former in the
      candidate list.  Then, the while loop never proceeds, resulting
      the task hung.
      
      __unix_gc() has the same loop just before purging the collected skb,
      so we can call kfree_skb(oob_skb) there and let __skb_queue_purge()
      release all inflight sockets.
      
      [0]:
      Sending NMI from CPU 0 to CPUs 1:
      NMI backtrace for cpu 1
      CPU: 1 PID: 2784 Comm: kworker/u4:8 Not tainted 6.8.0-rc4-syzkaller-01028-g71b605d3 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      Workqueue: events_unbound __unix_gc
      RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x70 kernel/kcov.c:200
      Code: 89 fb e8 23 00 00 00 48 8b 3d 84 f5 1a 0c 48 89 de 5b e9 43 26 57 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <f3> 0f 1e fa 48 8b 04 24 65 48 8b 0d 90 52 70 7e 65 8b 15 91 52 70
      RSP: 0018:ffffc9000a17fa78 EFLAGS: 00000287
      RAX: ffffffff8a0a6108 RBX: ffff88802b6c2640 RCX: ffff88802c0b3b80
      RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
      RBP: ffffc9000a17fbf0 R08: ffffffff89383f1d R09: 1ffff1100ee5ff84
      R10: dffffc0000000000 R11: ffffed100ee5ff85 R12: 1ffff110056d84ee
      R13: ffffc9000a17fae0 R14: 0000000000000000 R15: ffffffff8f47b840
      FS:  0000000000000000(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffef5687ff8 CR3: 0000000029b34000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <NMI>
       </NMI>
       <TASK>
       __unix_gc+0xe69/0xf40 net/unix/garbage.c:343
       process_one_work kernel/workqueue.c:2633 [inline]
       process_scheduled_works+0x913/0x1420 kernel/workqueue.c:2706
       worker_thread+0xa5f/0x1000 kernel/workqueue.c:2787
       kthread+0x2ef/0x390 kernel/kthread.c:388
       ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:242
       </TASK>
      
      Reported-and-tested-by: syzbot+ecab4d36f920c3574bf9@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=ecab4d36f920c3574bf9
      Fixes: 25236c91 ("af_unix: Fix task hung while purging oob_skb in GC.")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa82ac51
    • Alex Elder's avatar
      net: ipa: don't overrun IPA suspend interrupt registers · d80f8e96
      Alex Elder authored
      In newer hardware, IPA supports more than 32 endpoints.  Some
      registers--such as IPA interrupt registers--represent endpoints
      as bits in a 4-byte register, and such registers are repeated as
      needed to represent endpoints beyond the first 32.
      
      In ipa_interrupt_suspend_clear_all(), we clear all pending IPA
      suspend interrupts by reading all status register(s) and writing
      corresponding registers to clear interrupt conditions.
      
      Unfortunately the number of registers to read/write is calculated
      incorrectly, and as a result we access *many* more registers than
      intended.  This bug occurs only when the IPA hardware signals a
      SUSPEND interrupt, which happens when a packet is received for an
      endpoint (or its underlying GSI channel) that is suspended.  This
      situation is difficult to reproduce, but possible.
      
      Fix this by correctly computing the number of interrupt registers to
      read and write.  This is the only place in the code where registers
      that map endpoints or channels this way perform this calculation.
      
      Fixes: f298ba78 ("net: ipa: add a parameter to suspend registers")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d80f8e96
    • Eric Dumazet's avatar
      net: implement lockless setsockopt(SO_PEEK_OFF) · 56667da7
      Eric Dumazet authored
      syzbot reported a lockdep violation [1] involving af_unix
      support of SO_PEEK_OFF.
      
      Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket
      sk_peek_off field), there is really no point to enforce a pointless
      thread safety in the kernel.
      
      After this patch :
      
      - setsockopt(SO_PEEK_OFF) no longer acquires the socket lock.
      
      - skb_consume_udp() no longer has to acquire the socket lock.
      
      - af_unix no longer needs a special version of sk_set_peek_off(),
        because it does not lock u->iolock anymore.
      
      As a followup, we could replace prot->set_peek_off to be a boolean
      and avoid an indirect call, since we always use sk_set_peek_off().
      
      [1]
      
      WARNING: possible circular locking dependency detected
      6.8.0-rc4-syzkaller-00267-g0f1dd5e9 #0 Not tainted
      
      syz-executor.2/30025 is trying to acquire lock:
       ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
      
      but task is already holding lock:
       ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
       ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
       ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}:
              lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
              lock_sock_nested+0x48/0x100 net/core/sock.c:3524
              lock_sock include/net/sock.h:1691 [inline]
              __unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415
              sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046
              ____sys_recvmsg+0x3c0/0x470 net/socket.c:2801
              ___sys_recvmsg net/socket.c:2845 [inline]
              do_recvmmsg+0x474/0xae0 net/socket.c:2939
              __sys_recvmmsg net/socket.c:3018 [inline]
              __do_sys_recvmmsg net/socket.c:3041 [inline]
              __se_sys_recvmmsg net/socket.c:3034 [inline]
              __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
             do_syscall_64+0xf9/0x240
             entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      -> #0 (&u->iolock){+.+.}-{3:3}:
              check_prev_add kernel/locking/lockdep.c:3134 [inline]
              check_prevs_add kernel/locking/lockdep.c:3253 [inline]
              validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
              __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
              lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
              __mutex_lock_common kernel/locking/mutex.c:608 [inline]
              __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
              unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
             sk_setsockopt+0x207e/0x3360
              do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
              __sys_setsockopt+0x1ad/0x250 net/socket.c:2334
              __do_sys_setsockopt net/socket.c:2343 [inline]
              __se_sys_setsockopt net/socket.c:2340 [inline]
              __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
             do_syscall_64+0xf9/0x240
             entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sk_lock-AF_UNIX);
                                     lock(&u->iolock);
                                     lock(sk_lock-AF_UNIX);
        lock(&u->iolock);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor.2/30025:
        #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
        #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
        #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193
      
      stack backtrace:
      CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e9 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
        check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
        check_prev_add kernel/locking/lockdep.c:3134 [inline]
        check_prevs_add kernel/locking/lockdep.c:3253 [inline]
        validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
        __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
        lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
        __mutex_lock_common kernel/locking/mutex.c:608 [inline]
        __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
        unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
       sk_setsockopt+0x207e/0x3360
        do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
        __sys_setsockopt+0x1ad/0x250 net/socket.c:2334
        __do_sys_setsockopt net/socket.c:2343 [inline]
        __se_sys_setsockopt net/socket.c:2340 [inline]
        __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
       do_syscall_64+0xf9/0x240
       entry_SYSCALL_64_after_hwframe+0x6f/0x77
      RIP: 0033:0x7f78a1c7dda9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9
      RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006
      RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000
      R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8
      
      Fixes: 859051dd ("bpf: Implement cgroup sockaddr hooks for unix sockets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Daan De Meyer <daan.j.demeyer@gmail.com>
      Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
      Cc: Martin KaFai Lau <martin.lau@kernel.org>
      Cc: David Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56667da7