1. 05 Jul, 2023 17 commits
    • Thadeu Lima de Souza Cascardo's avatar
      netfilter: nf_tables: prevent OOB access in nft_byteorder_eval · caf3ef74
      Thadeu Lima de Souza Cascardo authored
      When evaluating byteorder expressions with size 2, a union with 32-bit and
      16-bit members is used. Since the 16-bit members are aligned to 32-bit,
      the array accesses will be out-of-bounds.
      
      It may lead to a stack-out-of-bounds access like the one below:
      
      [   23.095215] ==================================================================
      [   23.095625] BUG: KASAN: stack-out-of-bounds in nft_byteorder_eval+0x13c/0x320
      [   23.096020] Read of size 2 at addr ffffc90000007948 by task ping/115
      [   23.096358]
      [   23.096456] CPU: 0 PID: 115 Comm: ping Not tainted 6.4.0+ #413
      [   23.096770] Call Trace:
      [   23.096910]  <IRQ>
      [   23.097030]  dump_stack_lvl+0x60/0xc0
      [   23.097218]  print_report+0xcf/0x630
      [   23.097388]  ? nft_byteorder_eval+0x13c/0x320
      [   23.097577]  ? kasan_addr_to_slab+0xd/0xc0
      [   23.097760]  ? nft_byteorder_eval+0x13c/0x320
      [   23.097949]  kasan_report+0xc9/0x110
      [   23.098106]  ? nft_byteorder_eval+0x13c/0x320
      [   23.098298]  __asan_load2+0x83/0xd0
      [   23.098453]  nft_byteorder_eval+0x13c/0x320
      [   23.098659]  nft_do_chain+0x1c8/0xc50
      [   23.098852]  ? __pfx_nft_do_chain+0x10/0x10
      [   23.099078]  ? __kasan_check_read+0x11/0x20
      [   23.099295]  ? __pfx___lock_acquire+0x10/0x10
      [   23.099535]  ? __pfx___lock_acquire+0x10/0x10
      [   23.099745]  ? __kasan_check_read+0x11/0x20
      [   23.099929]  nft_do_chain_ipv4+0xfe/0x140
      [   23.100105]  ? __pfx_nft_do_chain_ipv4+0x10/0x10
      [   23.100327]  ? lock_release+0x204/0x400
      [   23.100515]  ? nf_hook.constprop.0+0x340/0x550
      [   23.100779]  nf_hook_slow+0x6c/0x100
      [   23.100977]  ? __pfx_nft_do_chain_ipv4+0x10/0x10
      [   23.101223]  nf_hook.constprop.0+0x334/0x550
      [   23.101443]  ? __pfx_ip_local_deliver_finish+0x10/0x10
      [   23.101677]  ? __pfx_nf_hook.constprop.0+0x10/0x10
      [   23.101882]  ? __pfx_ip_rcv_finish+0x10/0x10
      [   23.102071]  ? __pfx_ip_local_deliver_finish+0x10/0x10
      [   23.102291]  ? rcu_read_lock_held+0x4b/0x70
      [   23.102481]  ip_local_deliver+0xbb/0x110
      [   23.102665]  ? __pfx_ip_rcv+0x10/0x10
      [   23.102839]  ip_rcv+0x199/0x2a0
      [   23.102980]  ? __pfx_ip_rcv+0x10/0x10
      [   23.103140]  __netif_receive_skb_one_core+0x13e/0x150
      [   23.103362]  ? __pfx___netif_receive_skb_one_core+0x10/0x10
      [   23.103647]  ? mark_held_locks+0x48/0xa0
      [   23.103819]  ? process_backlog+0x36c/0x380
      [   23.103999]  __netif_receive_skb+0x23/0xc0
      [   23.104179]  process_backlog+0x91/0x380
      [   23.104350]  __napi_poll.constprop.0+0x66/0x360
      [   23.104589]  ? net_rx_action+0x1cb/0x610
      [   23.104811]  net_rx_action+0x33e/0x610
      [   23.105024]  ? _raw_spin_unlock+0x23/0x50
      [   23.105257]  ? __pfx_net_rx_action+0x10/0x10
      [   23.105485]  ? mark_held_locks+0x48/0xa0
      [   23.105741]  __do_softirq+0xfa/0x5ab
      [   23.105956]  ? __dev_queue_xmit+0x765/0x1c00
      [   23.106193]  do_softirq.part.0+0x49/0xc0
      [   23.106423]  </IRQ>
      [   23.106547]  <TASK>
      [   23.106670]  __local_bh_enable_ip+0xf5/0x120
      [   23.106903]  __dev_queue_xmit+0x789/0x1c00
      [   23.107131]  ? __pfx___dev_queue_xmit+0x10/0x10
      [   23.107381]  ? find_held_lock+0x8e/0xb0
      [   23.107585]  ? lock_release+0x204/0x400
      [   23.107798]  ? neigh_resolve_output+0x185/0x350
      [   23.108049]  ? mark_held_locks+0x48/0xa0
      [   23.108265]  ? neigh_resolve_output+0x185/0x350
      [   23.108514]  neigh_resolve_output+0x246/0x350
      [   23.108753]  ? neigh_resolve_output+0x246/0x350
      [   23.109003]  ip_finish_output2+0x3c3/0x10b0
      [   23.109250]  ? __pfx_ip_finish_output2+0x10/0x10
      [   23.109510]  ? __pfx_nf_hook+0x10/0x10
      [   23.109732]  __ip_finish_output+0x217/0x390
      [   23.109978]  ip_finish_output+0x2f/0x130
      [   23.110207]  ip_output+0xc9/0x170
      [   23.110404]  ip_push_pending_frames+0x1a0/0x240
      [   23.110652]  raw_sendmsg+0x102e/0x19e0
      [   23.110871]  ? __pfx_raw_sendmsg+0x10/0x10
      [   23.111093]  ? lock_release+0x204/0x400
      [   23.111304]  ? __mod_lruvec_page_state+0x148/0x330
      [   23.111567]  ? find_held_lock+0x8e/0xb0
      [   23.111777]  ? find_held_lock+0x8e/0xb0
      [   23.111993]  ? __rcu_read_unlock+0x7c/0x2f0
      [   23.112225]  ? aa_sk_perm+0x18a/0x550
      [   23.112431]  ? filemap_map_pages+0x4f1/0x900
      [   23.112665]  ? __pfx_aa_sk_perm+0x10/0x10
      [   23.112880]  ? find_held_lock+0x8e/0xb0
      [   23.113098]  inet_sendmsg+0xa0/0xb0
      [   23.113297]  ? inet_sendmsg+0xa0/0xb0
      [   23.113500]  ? __pfx_inet_sendmsg+0x10/0x10
      [   23.113727]  sock_sendmsg+0xf4/0x100
      [   23.113924]  ? move_addr_to_kernel.part.0+0x4f/0xa0
      [   23.114190]  __sys_sendto+0x1d4/0x290
      [   23.114391]  ? __pfx___sys_sendto+0x10/0x10
      [   23.114621]  ? __pfx_mark_lock.part.0+0x10/0x10
      [   23.114869]  ? lock_release+0x204/0x400
      [   23.115076]  ? find_held_lock+0x8e/0xb0
      [   23.115287]  ? rcu_is_watching+0x23/0x60
      [   23.115503]  ? __rseq_handle_notify_resume+0x6e2/0x860
      [   23.115778]  ? __kasan_check_write+0x14/0x30
      [   23.116008]  ? blkcg_maybe_throttle_current+0x8d/0x770
      [   23.116285]  ? mark_held_locks+0x28/0xa0
      [   23.116503]  ? do_syscall_64+0x37/0x90
      [   23.116713]  __x64_sys_sendto+0x7f/0xb0
      [   23.116924]  do_syscall_64+0x59/0x90
      [   23.117123]  ? irqentry_exit_to_user_mode+0x25/0x30
      [   23.117387]  ? irqentry_exit+0x77/0xb0
      [   23.117593]  ? exc_page_fault+0x92/0x140
      [   23.117806]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [   23.118081] RIP: 0033:0x7f744aee2bba
      [   23.118282] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
      [   23.119237] RSP: 002b:00007ffd04a7c9f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [   23.119644] RAX: ffffffffffffffda RBX: 00007ffd04a7e0a0 RCX: 00007f744aee2bba
      [   23.120023] RDX: 0000000000000040 RSI: 000056488e9e6300 RDI: 0000000000000003
      [   23.120413] RBP: 000056488e9e6300 R08: 00007ffd04a80320 R09: 0000000000000010
      [   23.120809] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
      [   23.121219] R13: 00007ffd04a7dc38 R14: 00007ffd04a7ca00 R15: 00007ffd04a7e0a0
      [   23.121617]  </TASK>
      [   23.121749]
      [   23.121845] The buggy address belongs to the virtual mapping at
      [   23.121845]  [ffffc90000000000, ffffc90000009000) created by:
      [   23.121845]  irq_init_percpu_irqstack+0x1cf/0x270
      [   23.122707]
      [   23.122803] The buggy address belongs to the physical page:
      [   23.123104] page:0000000072ac19f0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x24a09
      [   23.123609] flags: 0xfffffc0001000(reserved|node=0|zone=1|lastcpupid=0x1fffff)
      [   23.123998] page_type: 0xffffffff()
      [   23.124194] raw: 000fffffc0001000 ffffea0000928248 ffffea0000928248 0000000000000000
      [   23.124610] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
      [   23.125023] page dumped because: kasan: bad access detected
      [   23.125326]
      [   23.125421] Memory state around the buggy address:
      [   23.125682]  ffffc90000007800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [   23.126072]  ffffc90000007880: 00 00 00 00 00 f1 f1 f1 f1 f1 f1 00 00 f2 f2 00
      [   23.126455] >ffffc90000007900: 00 00 00 00 00 00 00 00 00 f2 f2 f2 f2 00 00 00
      [   23.126840]                                               ^
      [   23.127138]  ffffc90000007980: 00 00 00 00 00 00 00 00 00 00 00 00 00 f3 f3 f3
      [   23.127522]  ffffc90000007a00: f3 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
      [   23.127906] ==================================================================
      [   23.128324] Disabling lock debugging due to kernel taint
      
      Using simple s16 pointers for the 16-bit accesses fixes the problem. For
      the 32-bit accesses, src and dst can be used directly.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Cc: stable@vger.kernel.org
      Reported-by: Tanguy DUBROCA (@SidewayRE) from @Synacktiv working with ZDI
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@canonical.com>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      caf3ef74
    • Thadeu Lima de Souza Cascardo's avatar
      netfilter: nf_tables: do not ignore genmask when looking up chain by id · 515ad530
      Thadeu Lima de Souza Cascardo authored
      When adding a rule to a chain referring to its ID, if that chain had been
      deleted on the same batch, the rule might end up referring to a deleted
      chain.
      
      This will lead to a WARNING like following:
      
      [   33.098431] ------------[ cut here ]------------
      [   33.098678] WARNING: CPU: 5 PID: 69 at net/netfilter/nf_tables_api.c:2037 nf_tables_chain_destroy+0x23d/0x260
      [   33.099217] Modules linked in:
      [   33.099388] CPU: 5 PID: 69 Comm: kworker/5:1 Not tainted 6.4.0+ #409
      [   33.099726] Workqueue: events nf_tables_trans_destroy_work
      [   33.100018] RIP: 0010:nf_tables_chain_destroy+0x23d/0x260
      [   33.100306] Code: 8b 7c 24 68 e8 64 9c ed fe 4c 89 e7 e8 5c 9c ed fe 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d 31 c0 89 c6 89 c7 c3 cc cc cc cc <0f> 0b 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d 31 c0 89 c6 89 c7
      [   33.101271] RSP: 0018:ffffc900004ffc48 EFLAGS: 00010202
      [   33.101546] RAX: 0000000000000001 RBX: ffff888006fc0a28 RCX: 0000000000000000
      [   33.101920] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      [   33.102649] RBP: ffffc900004ffc78 R08: 0000000000000000 R09: 0000000000000000
      [   33.103018] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8880135ef500
      [   33.103385] R13: 0000000000000000 R14: dead000000000122 R15: ffff888006fc0a10
      [   33.103762] FS:  0000000000000000(0000) GS:ffff888024c80000(0000) knlGS:0000000000000000
      [   33.104184] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   33.104493] CR2: 00007fe863b56a50 CR3: 00000000124b0001 CR4: 0000000000770ee0
      [   33.104872] PKRU: 55555554
      [   33.104999] Call Trace:
      [   33.105113]  <TASK>
      [   33.105214]  ? show_regs+0x72/0x90
      [   33.105371]  ? __warn+0xa5/0x210
      [   33.105520]  ? nf_tables_chain_destroy+0x23d/0x260
      [   33.105732]  ? report_bug+0x1f2/0x200
      [   33.105902]  ? handle_bug+0x46/0x90
      [   33.106546]  ? exc_invalid_op+0x19/0x50
      [   33.106762]  ? asm_exc_invalid_op+0x1b/0x20
      [   33.106995]  ? nf_tables_chain_destroy+0x23d/0x260
      [   33.107249]  ? nf_tables_chain_destroy+0x30/0x260
      [   33.107506]  nf_tables_trans_destroy_work+0x669/0x680
      [   33.107782]  ? mark_held_locks+0x28/0xa0
      [   33.107996]  ? __pfx_nf_tables_trans_destroy_work+0x10/0x10
      [   33.108294]  ? _raw_spin_unlock_irq+0x28/0x70
      [   33.108538]  process_one_work+0x68c/0xb70
      [   33.108755]  ? lock_acquire+0x17f/0x420
      [   33.108977]  ? __pfx_process_one_work+0x10/0x10
      [   33.109218]  ? do_raw_spin_lock+0x128/0x1d0
      [   33.109435]  ? _raw_spin_lock_irq+0x71/0x80
      [   33.109634]  worker_thread+0x2bd/0x700
      [   33.109817]  ? __pfx_worker_thread+0x10/0x10
      [   33.110254]  kthread+0x18b/0x1d0
      [   33.110410]  ? __pfx_kthread+0x10/0x10
      [   33.110581]  ret_from_fork+0x29/0x50
      [   33.110757]  </TASK>
      [   33.110866] irq event stamp: 1651
      [   33.111017] hardirqs last  enabled at (1659): [<ffffffffa206a209>] __up_console_sem+0x79/0xa0
      [   33.111379] hardirqs last disabled at (1666): [<ffffffffa206a1ee>] __up_console_sem+0x5e/0xa0
      [   33.111740] softirqs last  enabled at (1616): [<ffffffffa1f5d40e>] __irq_exit_rcu+0x9e/0xe0
      [   33.112094] softirqs last disabled at (1367): [<ffffffffa1f5d40e>] __irq_exit_rcu+0x9e/0xe0
      [   33.112453] ---[ end trace 0000000000000000 ]---
      
      This is due to the nft_chain_lookup_byid ignoring the genmask. After this
      change, adding the new rule will fail as it will not find the chain.
      
      Fixes: 837830a4 ("netfilter: nf_tables: add NFTA_RULE_CHAIN_ID attribute")
      Cc: stable@vger.kernel.org
      Reported-by: Mingi Cho of Theori working with ZDI
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@canonical.com>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      515ad530
    • Florian Westphal's avatar
      netfilter: conntrack: don't fold port numbers into addresses before hashing · eaf9e719
      Florian Westphal authored
      Originally this used jhash2() over tuple and folded the zone id,
      the pernet hash value, destination port and l4 protocol number into the
      32bit seed value.
      
      When the switch to siphash was done, I used an on-stack temporary
      buffer to build a suitable key to be hashed via siphash().
      
      But this showed up as performance regression, so I got rid of
      the temporary copy and collected to-be-hashed data in 4 u64 variables.
      
      This makes it easy to build tuples that produce the same hash, which isn't
      desirable even though chain lengths are limited.
      
      Switch back to plain siphash, but just like with jhash2(), take advantage
      of the fact that most of to-be-hashed data is already in a suitable order.
      
      Use an empty struct as annotation in 'struct nf_conntrack_tuple' to mark
      last member that can be used as hash input.
      
      The only remaining data that isn't present in the tuple structure are the
      zone identifier and the pernet hash: fold those into the key.
      
      Fixes: d2c806ab ("netfilter: conntrack: use siphash_4u64")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      eaf9e719
    • Florent Revest's avatar
      netfilter: conntrack: Avoid nf_ct_helper_hash uses after free · 6eef7a2b
      Florent Revest authored
      If nf_conntrack_init_start() fails (for example due to a
      register_nf_conntrack_bpf() failure), the nf_conntrack_helper_fini()
      clean-up path frees the nf_ct_helper_hash map.
      
      When built with NF_CONNTRACK=y, further netfilter modules (e.g:
      netfilter_conntrack_ftp) can still be loaded and call
      nf_conntrack_helpers_register(), independently of whether nf_conntrack
      initialized correctly. This accesses the nf_ct_helper_hash dangling
      pointer and causes a uaf, possibly leading to random memory corruption.
      
      This patch guards nf_conntrack_helper_register() from accessing a freed
      or uninitialized nf_ct_helper_hash pointer and fixes possible
      uses-after-free when loading a conntrack module.
      
      Cc: stable@vger.kernel.org
      Fixes: 12f7a505 ("netfilter: add user-space connection tracking helper infrastructure")
      Signed-off-by: default avatarFlorent Revest <revest@chromium.org>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      6eef7a2b
    • Florian Westphal's avatar
      netfilter: conntrack: gre: don't set assured flag for clash entries · 8a9dc07b
      Florian Westphal authored
      Now that conntrack core is allowd to insert clashing entries, make sure
      GRE won't set assured flag on NAT_CLASH entries, just like UDP.
      
      Doing so prevents early_drop logic for these entries.
      
      Fixes: d671fd82 ("netfilter: conntrack: allow insertion clash of gre protocol")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      8a9dc07b
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: report use refcount overflow · 1689f259
      Pablo Neira Ayuso authored
      Overflow use refcount checks are not complete.
      
      Add helper function to deal with object reference counter tracking.
      Report -EMFILE in case UINT_MAX is reached.
      
      nft_use_dec() splats in case that reference counter underflows,
      which should not ever happen.
      
      Add nft_use_inc_restore() and nft_use_dec_restore() which are used
      to restore reference counter from error and abort paths.
      
      Use u32 in nft_flowtable and nft_object since helper functions cannot
      work on bitfields.
      
      Remove the few early incomplete checks now that the helper functions
      are in place and used to check for refcount overflow.
      
      Fixes: 96518518 ("netfilter: add nftables")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      1689f259
    • David S. Miller's avatar
      Merge branch 'mptcp-fixes' · c451410c
      David S. Miller authored
      Matthieu Baerts says:
      
      ====================
      mptcp: fixes for v6.5
      
      Here is a first batch of fixes for v6.5 and older.
      
      The fixes are not linked to each others.
      
      Patch 1 ensures subflows are unhashed before cleaning the backlog to
      avoid races. This fixes another recent fix from v6.4.
      
      Patch 2 does not rely on implicit state check in mptcp_listen() to avoid
      races when receiving an MP_FASTCLOSE. A regression from v5.17.
      
      The rest fixes issues in the selftests.
      
      Patch 3 makes sure errors when setting up the environment are no longer
      ignored. For v5.17+.
      
      Patch 4 uses 'iptables-legacy' if available to be able to run on older
      kernels. A fix for v5.13 and newer.
      
      Patch 5 catches errors when issues are detected with packet marks. Also
      for v5.13+.
      
      Patch 6 uses the correct variable instead of an undefined one. Even if
      there was no visible impact, it can help to find regressions later. An
      issue visible in v5.19+.
      
      Patch 7 makes sure errors with some sub-tests are reported to have the
      selftest marked as failed as expected. Also for v5.19+.
      
      Patch 8 adds a kernel config that is required to execute MPTCP
      selftests. It is valid for v5.9+.
      
      Patch 9 fixes issues when validating the userspace path-manager with
      32-bit arch, an issue affecting v5.19+.
      ====================
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      c451410c
    • Matthieu Baerts's avatar
      selftests: mptcp: pm_nl_ctl: fix 32-bit support · 61d96580
      Matthieu Baerts authored
      When using pm_nl_ctl to validate userspace path-manager's behaviours, it
      was failing on 32-bit architectures ~half of the time.
      
      pm_nl_ctl was not reporting any error but the command was not doing what
      it was expected to do. As a result, the expected linked event was not
      triggered after and the test failed.
      
      This is due to the fact the token given in argument to the application
      was parsed as an integer with atoi(): in a 32-bit arch, if the number
      was bigger than INT_MAX, 2147483647 was used instead.
      
      This can simply be fixed by using strtoul() instead of atoi().
      
      The errors have been seen "by chance" when manually looking at the
      results from LKFT.
      
      Fixes: 9a0b3650 ("selftests: mptcp: support MPTCP_PM_CMD_ANNOUNCE")
      Cc: stable@vger.kernel.org
      Fixes: ecd2a77d ("selftests: mptcp: support MPTCP_PM_CMD_REMOVE")
      Fixes: cf8d0a6d ("selftests: mptcp: support MPTCP_PM_CMD_SUBFLOW_CREATE")
      Fixes: 57cc361b ("selftests: mptcp: support MPTCP_PM_CMD_SUBFLOW_DESTROY")
      Fixes: ca188a25 ("selftests: mptcp: userspace PM support for MP_PRIO signals")
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61d96580
    • Matthieu Baerts's avatar
      selftests: mptcp: depend on SYN_COOKIES · 6c8880fc
      Matthieu Baerts authored
      MPTCP selftests are using TCP SYN Cookies for quite a while now, since
      v5.9.
      
      Some CIs don't have this config option enabled and this is causing
      issues in the tests:
      
        # ns1 MPTCP -> ns1 (10.0.1.1:10000      ) MPTCP     (duration   167ms) sysctl: cannot stat /proc/sys/net/ipv4/tcp_syncookies: No such file or directory
        # [ OK ]./mptcp_connect.sh: line 554: [: -eq: unary operator expected
      
      There is no impact in the results but the test is not doing what it is
      supposed to do.
      
      Fixes: fed61c4b ("selftests: mptcp: make 2nd net namespace use tcp syn cookies unconditionally")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c8880fc
    • Matthieu Baerts's avatar
      selftests: mptcp: userspace_pm: report errors with 'remove' tests · 966c6c3a
      Matthieu Baerts authored
      A message was mentioning an issue with the "remove" tests but the
      selftest was not marked as failed.
      
      Directly exit with an error like it is done everywhere else in this
      selftest.
      
      Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
      Fixes: 259a834f ("selftests: mptcp: functional tests for the userspace PM type")
      Cc: stable@vger.kernel.org
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      966c6c3a
    • Matthieu Baerts's avatar
      selftests: mptcp: userspace_pm: use correct server port · d8566d0e
      Matthieu Baerts authored
      "server4_port" variable is not set but "app4_port" is the server port in
      v4 and the correct variable name to use.
      
      The port is optional so there was no visible impact.
      
      Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
      Fixes: ca188a25 ("selftests: mptcp: userspace PM support for MP_PRIO signals")
      Cc: stable@vger.kernel.org
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8566d0e
    • Matthieu Baerts's avatar
      selftests: mptcp: sockopt: return error if wrong mark · 9ac4c28e
      Matthieu Baerts authored
      When an error was detected when checking the marks, a message was
      correctly printed mentioning the error but followed by another one
      saying everything was OK and the selftest was not marked as failed as
      expected.
      
      Now the 'ret' variable is directly set to 1 in order to make sure the
      exit is done with an error, similar to what is done in other functions.
      While at it, the error is correctly propagated to the caller.
      
      Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
      Fixes: dc65fe82 ("selftests: mptcp: add packet mark test case")
      Cc: stable@vger.kernel.org
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ac4c28e
    • Matthieu Baerts's avatar
      selftests: mptcp: sockopt: use 'iptables-legacy' if available · a5a5990c
      Matthieu Baerts authored
      IPTables commands using 'iptables-nft' fail on old kernels, at least
      on v5.15 because it doesn't see the default IPTables chains:
      
        $ iptables -L
        iptables/1.8.2 Failed to initialize nft: Protocol not supported
      
      As a first step before switching to NFTables, we can use iptables-legacy
      if available.
      
      Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
      Fixes: dc65fe82 ("selftests: mptcp: add packet mark test case")
      Cc: stable@vger.kernel.org
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5a5990c
    • Matthieu Baerts's avatar
      selftests: mptcp: connect: fail if nft supposed to work · 221e4550
      Matthieu Baerts authored
      In case of "external" errors when preparing the environment for the
      TProxy tests, the subtests were marked as skipped.
      
      This is fine but it means these errors are ignored. On MPTCP Public CI,
      we do want to catch such issues and mark the selftest as failed if there
      are such issues. We can then use mptcp_lib_fail_if_expected_feature()
      helper that has been recently added to fail if needed.
      
      Link: https://github.com/multipath-tcp/mptcp_net-next/issues/368
      Fixes: 5fb62e9c ("selftests: mptcp: add tproxy test case")
      Cc: stable@vger.kernel.org
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      221e4550
    • Paolo Abeni's avatar
      mptcp: do not rely on implicit state check in mptcp_listen() · 0226436a
      Paolo Abeni authored
      Since the blamed commit, closing the first subflow resets the first
      subflow socket state to SS_UNCONNECTED.
      
      The current mptcp listen implementation relies only on such
      state to prevent touching not-fully-disconnected sockets.
      
      Incoming mptcp fastclose (or paired endpoint removal) unconditionally
      closes the first subflow.
      
      All the above allows an incoming fastclose followed by a listen() call
      to successfully race with a blocking recvmsg(), potentially causing the
      latter to hit a divide by zero bug in cleanup_rbuf/__tcp_select_window().
      
      Address the issue explicitly checking the msk socket state in
      mptcp_listen(). An alternative solution would be moving the first
      subflow socket state update into mptcp_disconnect(), but in the long
      term the first subflow socket should be removed: better avoid relaying
      on it for internal consistency check.
      
      Fixes: b29fcfb5 ("mptcp: full disconnect implementation")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/414Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0226436a
    • Paolo Abeni's avatar
      mptcp: ensure subflow is unhashed before cleaning the backlog · 3fffa15b
      Paolo Abeni authored
      While tacking care of the mptcp-level listener I unintentionally
      moved the subflow level unhash after the subflow listener backlog
      cleanup.
      
      That could cause some nasty race and makes the code harder to read.
      
      Address the issue restoring the proper order of operations.
      
      Fixes: 57fc0f1c ("mptcp: ensure listener is unhashed before updating the sk status")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3fffa15b
    • Thorsten Winkler's avatar
      s390/qeth: Fix vipa deletion · 80de809b
      Thorsten Winkler authored
      Change boolean parameter of function "qeth_l3_vipa_store" inside the
      "qeth_l3_dev_vipa_del4_store" function from "true" to "false" because
      "true" is used for adding a virtual ip address and "false" for deleting.
      
      Fixes: 2390166a ("s390/qeth: clean up L3 sysfs code")
      Reviewed-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Signed-off-by: default avatarThorsten Winkler <twinkler@linux.ibm.com>
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80de809b
  2. 04 Jul, 2023 8 commits
  3. 03 Jul, 2023 12 commits
    • Rahul Rameshbabu's avatar
      ptp: Make max_phase_adjustment sysfs device attribute invisible when not supported · 2c5d234d
      Rahul Rameshbabu authored
      The .adjphase operation is an operation that is implemented only by certain
      PHCs. The sysfs device attribute node for querying the maximum phase
      adjustment supported should not be exposed on devices that do not support
      .adjphase.
      
      Fixes: c3b60ab7 ("ptp: Add .getmaxphase callback to ptp_clock_info")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reported-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Reported-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Link: https://lore.kernel.org/netdev/20230627162146.GA114473@dev-arch.thelio-3990X/
      Link: https://lore.kernel.org/all/CA+G9fYtKCZeAUTtwe69iK8Xcz1mOKQzwcy49wd+imZrfj6ifXA@mail.gmail.com/Tested-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Reviewed-by: default avatarPetr Vorel <pvorel@suse.cz>
      Message-ID: <20230627232139.213130-1-rrameshbabu@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2c5d234d
    • Subash Abhinov Kasiviswanathan's avatar
      Documentation: ABI: sysfs-class-net-qmi: pass_through contact update · acd97558
      Subash Abhinov Kasiviswanathan authored
      Switch to the quicinc.com id.
      
      Fixes: bd1af6b5 ("Documentation: ABI: sysfs-class-net-qmi: document pass-through file")
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acd97558
    • Eric Dumazet's avatar
      tcp: annotate data races in __tcp_oow_rate_limited() · 998127cd
      Eric Dumazet authored
      request sockets are lockless, __tcp_oow_rate_limited() could be called
      on the same object from different cpus. This is harmless.
      
      Add READ_ONCE()/WRITE_ONCE() annotations to avoid a KCSAN report.
      
      Fixes: 4ce7e93c ("tcp: rate limit ACK sent by SYN_RECV request sockets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      998127cd
    • David S. Miller's avatar
      Merge branch 'wireguard-fixes' · c94683ed
      David S. Miller authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard fixes for 6.4.2/6.5-rc1
      
      Sorry to send these patches during the merge window, but they're net
      fixes, not netdev enhancements, and while I'd ordinarily wait anyway,
      I just got a first bug report for one of these fixes, which I originally
      had thought was mostly unlikely. So please apply the following three
      patches to net:
      
      1) Make proper use of nr_cpu_ids with cpumask_next(), rather than
         awkwardly using modulo, to handle dynamic CPU topology changes.
         Linus noticed this a while ago and pointed it out, and today a user
         actually got hit by it.
      
      2) Respect persistent keepalive and other staged packets when setting
         the private key after the interface is already up.
      
      3) Use timer_delete_sync() instead of del_timer_sync(), per the
         documentation.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c94683ed
    • Jason A. Donenfeld's avatar
      wireguard: timers: move to using timer_delete_sync · 326534e8
      Jason A. Donenfeld authored
      The documentation says that del_timer_sync is obsolete, and code should
      use the equivalent timer_delete_sync instead, so switch to it.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      326534e8
    • Jason A. Donenfeld's avatar
      wireguard: netlink: send staged packets when setting initial private key · f58d0a9b
      Jason A. Donenfeld authored
      Packets bound for peers can queue up prior to the device private key
      being set. For example, if persistent keepalive is set, a packet is
      queued up to be sent as soon as the device comes up. However, if the
      private key hasn't been set yet, the handshake message never sends, and
      no timer is armed to retry, since that would be pointless.
      
      But, if a user later sets a private key, the expectation is that those
      queued packets, such as a persistent keepalive, are actually sent. So
      adjust the configuration logic to account for this edge case, and add a
      test case to make sure this works.
      
      Maxim noticed this with a wg-quick(8) config to the tune of:
      
          [Interface]
          PostUp = wg set %i private-key somefile
      
          [Peer]
          PublicKey = ...
          Endpoint = ...
          PersistentKeepalive = 25
      
      Here, the private key gets set after the device comes up using a PostUp
      script, triggering the bug.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarMaxim Cournoyer <maxim.cournoyer@gmail.com>
      Tested-by: default avatarMaxim Cournoyer <maxim.cournoyer@gmail.com>
      Link: https://lore.kernel.org/wireguard/87fs7xtqrv.fsf@gmail.com/Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f58d0a9b
    • Jason A. Donenfeld's avatar
      wireguard: queueing: use saner cpu selection wrapping · 7387943f
      Jason A. Donenfeld authored
      Using `% nr_cpumask_bits` is slow and complicated, and not totally
      robust toward dynamic changes to CPU topologies. Rather than storing the
      next CPU in the round-robin, just store the last one, and also return
      that value. This simplifies the loop drastically into a much more common
      pattern.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Tested-by: default avatarManuel Leiner <manuel.leiner@gmx.de>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7387943f
    • J.J. Martzki's avatar
      samples: pktgen: fix append mode failed issue · a27ac539
      J.J. Martzki authored
      Each sample script sources functions.sh before parameters.sh
      which makes $APPEND undefined when trapping EXIT no matter in
      append mode or not. Due to this when sample scripts finished
      they always do "pgctrl reset" which resets pktgen config.
      
      So move trap to each script after sourcing parameters.sh
      and trap EXIT explicitly.
      Signed-off-by: default avatarJ.J. Martzki <mars14850@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a27ac539
    • Daniel Díaz's avatar
      selftests/net: Add xt_policy config for xfrm_policy test · f56d1eea
      Daniel Díaz authored
      When running Kselftests with the current selftests/net/config
      the following problem can be seen with the net:xfrm_policy.sh
      selftest:
      
        # selftests: net: xfrm_policy.sh
        [   41.076721] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
        [   41.094787] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
        [   41.107635] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
        # modprobe: FATAL: Module ip_tables not found in directory /lib/modules/6.1.36
        # iptables v1.8.7 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
        # Perhaps iptables or your kernel needs to be upgraded.
        # modprobe: FATAL: Module ip_tables not found in directory /lib/modules/6.1.36
        # iptables v1.8.7 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
        # Perhaps iptables or your kernel needs to be upgraded.
        # SKIP: Could not insert iptables rule
        ok 1 selftests: net: xfrm_policy.sh # SKIP
      
      This is because IPsec "policy" match support is not available
      to the kernel.
      
      This patch adds CONFIG_NETFILTER_XT_MATCH_POLICY as a module
      to the selftests/net/config file, so that `make
      kselftest-merge` can take this into consideration.
      Signed-off-by: default avatarDaniel Díaz <daniel.diaz@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f56d1eea
    • Eric Dumazet's avatar
      net: fix net_dev_start_xmit trace event vs skb_transport_offset() · f88fcb1d
      Eric Dumazet authored
      After blamed commit, we must be more careful about using
      skb_transport_offset(), as reminded us by syzbot:
      
      WARNING: CPU: 0 PID: 10 at include/linux/skbuff.h:2868 skb_transport_offset include/linux/skbuff.h:2977 [inline]
      WARNING: CPU: 0 PID: 10 at include/linux/skbuff.h:2868 perf_trace_net_dev_start_xmit+0x89a/0xce0 include/trace/events/net.h:14
      Modules linked in:
      CPU: 0 PID: 10 Comm: kworker/u4:1 Not tainted 6.1.30-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
      Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
      RIP: 0010:skb_transport_header include/linux/skbuff.h:2868 [inline]
      RIP: 0010:skb_transport_offset include/linux/skbuff.h:2977 [inline]
      RIP: 0010:perf_trace_net_dev_start_xmit+0x89a/0xce0 include/trace/events/net.h:14
      Code: 8b 04 25 28 00 00 00 48 3b 84 24 c0 00 00 00 0f 85 4e 04 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc e8 56 22 01 fd <0f> 0b e9 f6 fc ff ff 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c 86 f9 ff
      RSP: 0018:ffffc900002bf700 EFLAGS: 00010293
      RAX: ffffffff8485d8ca RBX: 000000000000ffff RCX: ffff888100914280
      RDX: 0000000000000000 RSI: 000000000000ffff RDI: 000000000000ffff
      RBP: ffffc900002bf818 R08: ffffffff8485d5b6 R09: fffffbfff0f8fb5e
      R10: 0000000000000000 R11: dffffc0000000001 R12: 1ffff110217d8f67
      R13: ffff88810bec7b3a R14: dffffc0000000000 R15: dffffc0000000000
      FS: 0000000000000000(0000) GS:ffff8881f6a00000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f96cf6d52f0 CR3: 000000012224c000 CR4: 0000000000350ef0
      Call Trace:
      <TASK>
      [<ffffffff84715e35>] trace_net_dev_start_xmit include/trace/events/net.h:14 [inline]
      [<ffffffff84715e35>] xmit_one net/core/dev.c:3643 [inline]
      [<ffffffff84715e35>] dev_hard_start_xmit+0x705/0x980 net/core/dev.c:3660
      [<ffffffff8471a232>] __dev_queue_xmit+0x16b2/0x3370 net/core/dev.c:4324
      [<ffffffff85416493>] dev_queue_xmit include/linux/netdevice.h:3030 [inline]
      [<ffffffff85416493>] batadv_send_skb_packet+0x3f3/0x680 net/batman-adv/send.c:108
      [<ffffffff85416744>] batadv_send_broadcast_skb+0x24/0x30 net/batman-adv/send.c:127
      [<ffffffff853bc52a>] batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:393 [inline]
      [<ffffffff853bc52a>] batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:421 [inline]
      [<ffffffff853bc52a>] batadv_iv_send_outstanding_bat_ogm_packet+0x69a/0x840 net/batman-adv/bat_iv_ogm.c:1701
      [<ffffffff8151023c>] process_one_work+0x8ac/0x1170 kernel/workqueue.c:2289
      [<ffffffff81511938>] worker_thread+0xaa8/0x12d0 kernel/workqueue.c:2436
      
      Fixes: 66e4c8d9 ("net: warn if transport header was not set")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f88fcb1d
    • Vladimir Oltean's avatar
      net: dsa: tag_sja1105: fix source port decoding in vlan_filtering=0 bridge mode · a398b9ea
      Vladimir Oltean authored
      There was a regression introduced by the blamed commit, where pinging to
      a VLAN-unaware bridge would fail with the repeated message "Couldn't
      decode source port" coming from the tagging protocol driver.
      
      When receiving packets with a bridge_vid as determined by
      dsa_tag_8021q_bridge_join(), dsa_8021q_rcv() will decode:
      - source_port = 0 (which isn't really valid, more like "don't know")
      - switch_id = 0 (which isn't really valid, more like "don't know")
      - vbid = value in range 1-7
      
      Since the blamed patch has reversed the order of the checks, we are now
      going to believe that source_port != -1 and switch_id != -1, so they're
      valid, but they aren't.
      
      The minimal solution to the problem is to only populate source_port and
      switch_id with what dsa_8021q_rcv() came up with, if the vbid is zero,
      i.e. the source port information is trustworthy.
      
      Fixes: c1ae02d8 ("net: dsa: tag_sja1105: always prefer source port information from INCL_SRCPT")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a398b9ea
    • Vladimir Oltean's avatar
      net: bridge: keep ports without IFF_UNICAST_FLT in BR_PROMISC mode · 6ca3c005
      Vladimir Oltean authored
      According to the synchronization rules for .ndo_get_stats() as seen in
      Documentation/networking/netdevices.rst, acquiring a plain spin_lock()
      should not be illegal, but the bridge driver implementation makes it so.
      
      After running these commands, I am being faced with the following
      lockdep splat:
      
      $ ip link add link swp0 name macsec0 type macsec encrypt on && ip link set swp0 up
      $ ip link add dev br0 type bridge vlan_filtering 1 && ip link set br0 up
      $ ip link set macsec0 master br0 && ip link set macsec0 up
      
        ========================================================
        WARNING: possible irq lock inversion dependency detected
        6.4.0-04295-g31b577b4bd4a #603 Not tainted
        --------------------------------------------------------
        swapper/1/0 just changed the state of lock:
        ffff6bd348724cd8 (&br->lock){+.-.}-{3:3}, at: br_forward_delay_timer_expired+0x34/0x198
        but this lock took another, SOFTIRQ-unsafe lock in the past:
         (&ocelot->stats_lock){+.+.}-{3:3}
      
        and interrupts could create inverse lock ordering between them.
      
        other info that might help us debug this:
        Chain exists of:
          &br->lock --> &br->hash_lock --> &ocelot->stats_lock
      
         Possible interrupt unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&ocelot->stats_lock);
                                       local_irq_disable();
                                       lock(&br->lock);
                                       lock(&br->hash_lock);
          <Interrupt>
            lock(&br->lock);
      
         *** DEADLOCK ***
      
      (details about the 3 locks skipped)
      
      swp0 is instantiated by drivers/net/dsa/ocelot/felix.c, and this
      only matters to the extent that its .ndo_get_stats64() method calls
      spin_lock(&ocelot->stats_lock).
      
      Documentation/locking/lockdep-design.rst says:
      
      | A lock is irq-safe means it was ever used in an irq context, while a lock
      | is irq-unsafe means it was ever acquired with irq enabled.
      
      (...)
      
      | Furthermore, the following usage based lock dependencies are not allowed
      | between any two lock-classes::
      |
      |    <hardirq-safe>   ->  <hardirq-unsafe>
      |    <softirq-safe>   ->  <softirq-unsafe>
      
      Lockdep marks br->hash_lock as softirq-safe, because it is sometimes
      taken in softirq context (for example br_fdb_update() which runs in
      NET_RX softirq), and when it's not in softirq context it blocks softirqs
      by using spin_lock_bh().
      
      Lockdep marks ocelot->stats_lock as softirq-unsafe, because it never
      blocks softirqs from running, and it is never taken from softirq
      context. So it can always be interrupted by softirqs.
      
      There is a call path through which a function that holds br->hash_lock:
      fdb_add_hw_addr() will call a function that acquires ocelot->stats_lock:
      ocelot_port_get_stats64(). This can be seen below:
      
      ocelot_port_get_stats64+0x3c/0x1e0
      felix_get_stats64+0x20/0x38
      dsa_slave_get_stats64+0x3c/0x60
      dev_get_stats+0x74/0x2c8
      rtnl_fill_stats+0x4c/0x150
      rtnl_fill_ifinfo+0x5cc/0x7b8
      rtmsg_ifinfo_build_skb+0xe4/0x150
      rtmsg_ifinfo+0x5c/0xb0
      __dev_notify_flags+0x58/0x200
      __dev_set_promiscuity+0xa0/0x1f8
      dev_set_promiscuity+0x30/0x70
      macsec_dev_change_rx_flags+0x68/0x88
      __dev_set_promiscuity+0x1a8/0x1f8
      __dev_set_rx_mode+0x74/0xa8
      dev_uc_add+0x74/0xa0
      fdb_add_hw_addr+0x68/0xd8
      fdb_add_local+0xc4/0x110
      br_fdb_add_local+0x54/0x88
      br_add_if+0x338/0x4a0
      br_add_slave+0x20/0x38
      do_setlink+0x3a4/0xcb8
      rtnl_newlink+0x758/0x9d0
      rtnetlink_rcv_msg+0x2f0/0x550
      netlink_rcv_skb+0x128/0x148
      rtnetlink_rcv+0x24/0x38
      
      the plain English explanation for it is:
      
      The macsec0 bridge port is created without p->flags & BR_PROMISC,
      because it is what br_manage_promisc() decides for a VLAN filtering
      bridge with a single auto port.
      
      As part of the br_add_if() procedure, br_fdb_add_local() is called for
      the MAC address of the device, and this results in a call to
      dev_uc_add() for macsec0 while the softirq-safe br->hash_lock is taken.
      
      Because macsec0 does not have IFF_UNICAST_FLT, dev_uc_add() ends up
      calling __dev_set_promiscuity() for macsec0, which is propagated by its
      implementation, macsec_dev_change_rx_flags(), to the lower device: swp0.
      This triggers the call path:
      
      dev_set_promiscuity(swp0)
      -> rtmsg_ifinfo()
         -> dev_get_stats()
            -> ocelot_port_get_stats64()
      
      with a calling context that lockdep doesn't like (br->hash_lock held).
      
      Normally we don't see this, because even though many drivers that can be
      bridge ports don't support IFF_UNICAST_FLT, we need a driver that
      
      (a) doesn't support IFF_UNICAST_FLT, *and*
      (b) it forwards the IFF_PROMISC flag to another driver, and
      (c) *that* driver implements ndo_get_stats64() using a softirq-unsafe
          spinlock.
      
      Condition (b) is necessary because the first __dev_set_rx_mode() calls
      __dev_set_promiscuity() with "bool notify=false", and thus, the
      rtmsg_ifinfo() code path won't be entered.
      
      The same criteria also hold true for DSA switches which don't report
      IFF_UNICAST_FLT. When the DSA master uses a spin_lock() in its
      ndo_get_stats64() method, the same lockdep splat can be seen.
      
      I think the deadlock possibility is real, even though I didn't reproduce
      it, and I'm thinking of the following situation to support that claim:
      
      fdb_add_hw_addr() runs on a CPU A, in a context with softirqs locally
      disabled and br->hash_lock held, and may end up attempting to acquire
      ocelot->stats_lock.
      
      In parallel, ocelot->stats_lock is currently held by a thread B (say,
      ocelot_check_stats_work()), which is interrupted while holding it by a
      softirq which attempts to lock br->hash_lock.
      
      Thread B cannot make progress because br->hash_lock is held by A. Whereas
      thread A cannot make progress because ocelot->stats_lock is held by B.
      
      When taking the issue at face value, the bridge can avoid that problem
      by simply making the ports promiscuous from a code path with a saner
      calling context (br->hash_lock not held). A bridge port without
      IFF_UNICAST_FLT is going to become promiscuous as soon as we call
      dev_uc_add() on it (which we do unconditionally), so why not be
      preemptive and make it promiscuous right from the beginning, so as to
      not be taken by surprise.
      
      With this, we've broken the links between code that holds br->hash_lock
      or br->lock and code that calls into the ndo_change_rx_flags() or
      ndo_get_stats64() ops of the bridge port.
      
      Fixes: 2796d0c6 ("bridge: Automatically manage port promiscuous mode.")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ca3c005
  4. 02 Jul, 2023 3 commits
    • David S. Miller's avatar
      Merge branch 'octeontx2-af-fixes' · 97791d3c
      David S. Miller authored
      Hariprasad Kelam says:
      
      ====================
      octeontx2-af: MAC block fixes for CN10KB
      
      This patch set contains fixes for the issues encountered in testing
      CN10KB MAC block RPM_USX.
      
      Patch1: firmware to kernel communication is not working due to wrong
              interrupt configuration. CSR addresses are corrected.
      
      Patch2: NIX to RVU PF mapping errors encountered due to wrong firmware
              config. Corrects this mapping error.
      
      Patch3: Driver is trying to access non exist cgx/lmac which is resulting
              in kernel panic. Address this issue by adding proper checks.
      
      Patch4: MAC features are not getting reset on FLR. Fix the issue by
              resetting the stale config.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      97791d3c
    • Hariprasad Kelam's avatar
      octeontx2-af: Reset MAC features in FLR · 2e3e94c2
      Hariprasad Kelam authored
      AF driver configures MAC features like internal loopback and PFC upon
      receiving the request from PF and its VF netdev. But these
      features are not getting reset in FLR.  This patch fixes the issue by
      resetting the same.
      
      Fixes: 23999b30 ("octeontx2-af: Enable or disable CGX internal loopback")
      Fixes: 1121f6b0 ("octeontx2-af: Priority flow control configuration support")
      Signed-off-by: default avatarHariprasad Kelam <hkelam@marvell.com>
      Signed-off-by: default avatarSunil Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e3e94c2
    • Hariprasad Kelam's avatar
      octeontx2-af: Add validation before accessing cgx and lmac · 79ebb537
      Hariprasad Kelam authored
      with the addition of new MAC blocks like CN10K RPM and CN10KB
      RPM_USX, LMACs are noncontiguous and CGX blocks are also
      noncontiguous. But during RVU driver initialization, the driver
      is assuming they are contiguous and trying to access
      cgx or lmac with their id which is resulting in kernel panic.
      
      This patch fixes the issue by adding proper checks.
      
      [   23.219150] pc : cgx_lmac_read+0x38/0x70
      [   23.219154] lr : rvu_program_channels+0x3f0/0x498
      [   23.223852] sp : ffff000100d6fc80
      [   23.227158] x29: ffff000100d6fc80 x28: ffff00010009f880 x27:
      000000000000005a
      [   23.234288] x26: ffff000102586768 x25: 0000000000002500 x24:
      fffffffffff0f000
      
      Fixes: 91c6945e ("octeontx2-af: cn10k: Add RPM MAC support")
      Signed-off-by: default avatarHariprasad Kelam <hkelam@marvell.com>
      Signed-off-by: default avatarSunil Goutham <sgoutham@marvell.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79ebb537