1. 12 Apr, 2024 9 commits
    • Jon Maloy's avatar
      tcp: add support for SO_PEEK_OFF socket option · 05ea4916
      Jon Maloy authored
      When reading received messages from a socket with MSG_PEEK, we may want
      to read the contents with an offset, like we can do with pread/preadv()
      when reading files. Currently, it is not possible to do that.
      
      In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
      in a similar way it is done for Unix Domain sockets.
      
      In the iperf3 log examples shown below, we can observe a throughput
      improvement of 15-20 % in the direction host->namespace when using the
      protocol splicer 'pasta' (https://passt.top).
      This is a consistent result.
      
      pasta(1) and passt(1) implement user-mode networking for network
      namespaces (containers) and virtual machines by means of a translation
      layer between Layer-2 network interface and native Layer-4 sockets
      (TCP, UDP, ICMP/ICMPv6 echo).
      
      Received, pending TCP data to the container/guest is kept in kernel
      buffers until acknowledged, so the tool routinely needs to fetch new
      data from socket, skipping data that was already sent.
      
      At the moment this is implemented using a dummy buffer passed to
      recvmsg(). With this change, we don't need a dummy buffer and the
      related buffer copy (copy_to_user()) anymore.
      
      passt and pasta are supported in KubeVirt and libvirt/qemu.
      
      jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
      SO_PEEK_OFF not supported by kernel.
      
      jmaloy@freyr:~/passt# iperf3 -s
      -----------------------------------------------------------
      Server listening on 5201 (test #1)
      -----------------------------------------------------------
      Accepted connection from 192.168.122.1, port 44822
      [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
      [ ID] Interval           Transfer     Bitrate
      [  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
      [  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
      [  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
      [  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
      [  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
      [  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
      [  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
      [  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
      [  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
      [  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
      [  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID] Interval           Transfer     Bitrate
      [  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
      -----------------------------------------------------------
      Server listening on 5201 (test #2)
      -----------------------------------------------------------
      ^Ciperf3: interrupt - the server has terminated
      jmaloy@freyr:~/passt#
      logout
      [ perf record: Woken up 23 times to write data ]
      [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
      jmaloy@freyr:~/passt$
      
      jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
      SO_PEEK_OFF supported by kernel.
      
      jmaloy@freyr:~/passt# iperf3 -s
      -----------------------------------------------------------
      Server listening on 5201 (test #1)
      -----------------------------------------------------------
      Accepted connection from 192.168.122.1, port 52084
      [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
      [ ID] Interval           Transfer     Bitrate
      [  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
      [  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
      [  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
      [  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
      [  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
      [  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
      [  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
      [  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
      [  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
      [  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
      [  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID] Interval           Transfer     Bitrate
      [  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
      -----------------------------------------------------------
      Server listening on 5201 (test #2)
      -----------------------------------------------------------
      ^Ciperf3: interrupt - the server has terminated
      logout
      [ perf record: Woken up 20 times to write data ]
      [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
      jmaloy@freyr:~/passt$
      
      The perf record confirms this result. Below, we can observe that the
      CPU spends significantly less time in the function ____sys_recvmsg()
      when we have offset support.
      
      Without offset support:
      ----------------------
      jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                             -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
      46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
      
      With offset support:
      ----------------------
      jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                             -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
      28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarJon Maloy <jmaloy@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240409152805.913891-1-jmaloy@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      05ea4916
    • Breno Leitao's avatar
      net: usb: qmi_wwan: Remove generic .ndo_get_stats64 · 3cddfeca
      Breno Leitao authored
      Commit 3e2f544d ("net: get stats64 if device if driver is
      configured") moved the callback to dev_get_tstats64() to net core, so,
      unless the driver is doing some custom stats collection, it does not
      need to set .ndo_get_stats64.
      
      Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
      doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
      function pointer.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240409133307.2058099-2-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3cddfeca
    • Breno Leitao's avatar
      net: usb: qmi_wwan: Leverage core stats allocator · 8959bf2a
      Breno Leitao authored
      With commit 34d21de9 ("net: Move {l,t,d}stats allocation to core and
      convert veth & vrf"), stats allocation could be done on net core
      instead of in this driver.
      
      With this new approach, the driver doesn't have to bother with error
      handling (allocation failure checking, making sure free happens in the
      right spot, etc). This is core responsibility now.
      
      Remove the allocation in the qmi_wwan driver and leverage the network
      core allocation instead.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240409133307.2058099-1-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8959bf2a
    • Eric Dumazet's avatar
      mpls: no longer hold RTNL in mpls_netconf_dump_devconf() · e0f89d28
      Eric Dumazet authored
      - Use for_each_netdev_dump() to no longer rely
        on net->dev_index_head hash table.
      
      - No longer care of net->dev_base_seq
      
      - Fix return value at the end of a dump,
        so that NLMSG_DONE can be appended to current skb,
        saving one recvmsg() system call.
      
      - No longer grab RTNL, RCU protection is enough,
        afer adding one READ_ONCE(mdev->input_enabled)
        in mpls_netconf_fill_devconf()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240410111951.2673193-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e0f89d28
    • Asbjørn Sloth Tønnesen's avatar
      flow_offload: fix flow_offload_has_one_action() kdoc · e1eb10f8
      Asbjørn Sloth Tønnesen authored
      include/net/flow_offload.h:351: warning:
        No description found for return value of 'flow_offload_has_one_action'
      Signed-off-by: default avatarAsbjørn Sloth Tønnesen <ast@fiberby.net>
      Reviewed-by: default avatarPieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
      Link: https://lore.kernel.org/r/20240410114718.15145-1-ast@fiberby.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1eb10f8
    • Carolina Jubran's avatar
      net/mlx5e: Expose the VF/SF RX drop counter on the representor · 919b38a9
      Carolina Jubran authored
      Q counters are device-level counters that track specific
      events, among which are out_of_buffer events. These events
      occur when packets are dropped due to a lack of receive
      buffer in the RX queue.
      
      Expose the total number of out_of_buffer events on the
      VFs/SFs to their respective representor, using the
      "ip stats group link" under the name of "rx_missed".
      
      The "rx_missed" equals the sum of all
      Q counters out_of_buffer values allocated on the VFs/SFs.
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Reviewed-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240410214154.250583-1-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      919b38a9
    • Jakub Kicinski's avatar
      Merge branch 'minor-cleanups-to-skb-frag-ref-unref' · ef4ba011
      Jakub Kicinski authored
      Mina Almasry says:
      
      ====================
      Minor cleanups to skb frag ref/unref
      
      This series is largely motivated by a recent discussion where there was
      some confusion on how to properly ref/unref pp pages vs non pp pages:
      
      https://lore.kernel.org/netdev/CAHS8izOoO-EovwMwAm9tLYetwikNPxC0FKyVGu1TPJWSz4bGoA@mail.gmail.com/T/#t
      
      There is some subtely there because pp uses page->pp_ref_count for
      refcounting, while non-pp uses get_page()/put_page() for ref counting.
      Getting the refcounting pairs wrong can lead to kernel crash.
      
      Additionally currently it may not be obvious to skb users unaware of
      page pool internals how to properly acquire a ref on a pp frag. It
      requires checking of skb->pp_recycle & is_pp_page() to make the correct
      calls and may require some handling at the call site aware of arguable pp
      internals.
      
      This series is a minor refactor with a couple of goals:
      
      1. skb users should be able to ref/unref a frag using
         [__]skb_frag_[un]ref() functions without needing to understand pp
         concepts and pp_ref_count vs get/put_page() differences.
      
      2. reference counting functions should have a mirror opposite. I.e. there
         should be a foo_unref() to every foo_ref() with a mirror opposite
         implementation (as much as possible).
      
      This is RFC to collect feedback if this change is desirable, but also so
      that I don't race with the fix for the issue Dragos is seeing for his
      crash.
      
      https://lore.kernel.org/lkml/CAHS8izN436pn3SndrzsCyhmqvJHLyxgCeDpWXA4r1ANt3RCDLQ@mail.gmail.com/T/
      ====================
      
      Link: https://lore.kernel.org/r/20240410190505.1225848-1-almasrymina@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef4ba011
    • Mina Almasry's avatar
      net: mirror skb frag ref/unref helpers · a580ea99
      Mina Almasry authored
      Refactor some of the skb frag ref/unref helpers for improved clarity.
      
      Implement napi_pp_get_page() to be the mirror counterpart of
      napi_pp_put_page().
      
      Implement skb_page_ref() to be the mirror of skb_page_unref().
      
      Improve __skb_frag_ref() to become a mirror counterpart of
      __skb_frag_unref(). Previously unref could handle pp & non-pp pages,
      while the ref could only handle non-pp pages. Now both the ref & unref
      helpers can correctly handle both pp & non-pp pages.
      
      Now that __skb_frag_ref() can handle both pp & non-pp pages, remove
      skb_pp_frag_ref(), and use __skb_frag_ref() instead.  This lets us
      remove pp specific handling from skb_try_coalesce.
      
      Additionally, since __skb_frag_ref() can now handle both pp & non-pp
      pages, a latent issue in skb_shift() should now be fixed. Previously
      this function would do a non-pp ref & pp unref on potential pp frags
      (fragfrom). After this patch, skb_shift() should correctly do a pp
      ref/unref on pp frags.
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Reviewed-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240410190505.1225848-3-almasrymina@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a580ea99
    • Mina Almasry's avatar
      net: move skb ref helpers to new header · f6d827b1
      Mina Almasry authored
      Add a new header, linux/skbuff_ref.h, which contains all the skb_*_ref()
      helpers. Many of the consumers of skbuff.h do not actually use any of
      the skb ref helpers, and we can speed up compilation a bit by minimizing
      this header file.
      
      Additionally in the later patch in the series we add page_pool support
      to skb_frag_ref(), which requires some page_pool dependencies. We can
      now add these dependencies to skbuff_ref.h instead of a very ubiquitous
      skbuff.h
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Link: https://lore.kernel.org/r/20240410190505.1225848-2-almasrymina@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f6d827b1
  2. 11 Apr, 2024 31 commits
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 94426ed2
      Jakub Kicinski authored
      Cross-merge networking fixes after downstream PR.
      
      Conflicts:
      
      net/unix/garbage.c
        47d8ac01 ("af_unix: Fix garbage collector racing against connect()")
        4090fa37 ("af_unix: Replace garbage collection algorithm.")
      
      Adjacent changes:
      
      drivers/net/ethernet/broadcom/bnxt/bnxt.c
        faa12ca2 ("bnxt_en: Reset PTP tx_avail after possible firmware reset")
        b3d0083c ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")
      
      drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
        7ac10c7d ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()")
        194fad5b ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions")
      
      drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
        958f56e4 ("net/mlx5e: Un-expose functions in en.h")
        49e6c938 ("net/mlx5e: RSS, Block XOR hash with over 128 channels")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      94426ed2
    • Linus Torvalds's avatar
      Merge tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 2ae9a897
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from bluetooth.
      
        Current release - new code bugs:
      
         - netfilter: complete validation of user input
      
         - mlx5: disallow SRIOV switchdev mode when in multi-PF netdev
      
        Previous releases - regressions:
      
         - core: fix u64_stats_init() for lockdep when used repeatedly in one
           file
      
         - ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr
      
         - bluetooth: fix memory leak in hci_req_sync_complete()
      
         - batman-adv: avoid infinite loop trying to resize local TT
      
         - drv: geneve: fix header validation in geneve[6]_xmit_skb
      
         - drv: bnxt_en: fix possible memory leak in
           bnxt_rdma_aux_device_init()
      
         - drv: mlx5: offset comp irq index in name by one
      
         - drv: ena: avoid double-free clearing stale tx_info->xdpf value
      
         - drv: pds_core: fix pdsc_check_pci_health deadlock
      
        Previous releases - always broken:
      
         - xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING
      
         - bluetooth: fix setsockopt not validating user input
      
         - af_unix: clear stale u->oob_skb.
      
         - nfc: llcp: fix nfc_llcp_setsockopt() unsafe copies
      
         - drv: virtio_net: fix guest hangup on invalid RSS update
      
         - drv: mlx5e: Fix mlx5e_priv_init() cleanup flow
      
         - dsa: mt7530: trap link-local frames regardless of ST Port State"
      
      * tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits)
        net: ena: Set tx_info->xdpf value to NULL
        net: ena: Fix incorrect descriptor free behavior
        net: ena: Wrong missing IO completions check order
        net: ena: Fix potential sign extension issue
        af_unix: Fix garbage collector racing against connect()
        net: dsa: mt7530: trap link-local frames regardless of ST Port State
        Revert "s390/ism: fix receive message buffer allocation"
        net: sparx5: fix wrong config being used when reconfiguring PCS
        net/mlx5: fix possible stack overflows
        net/mlx5: Disallow SRIOV switchdev mode when in multi-PF netdev
        net/mlx5e: RSS, Block XOR hash with over 128 channels
        net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit
        net/mlx5e: HTB, Fix inconsistencies with QoS SQs number
        net/mlx5e: Fix mlx5e_priv_init() cleanup flow
        net/mlx5e: RSS, Block changing channels number when RXFH is configured
        net/mlx5: Correctly compare pkt reformat ids
        net/mlx5: Properly link new fs rules into the tree
        net/mlx5: offset comp irq index in name by one
        net/mlx5: Register devlink first under devlink lock
        net/mlx5: E-switch, store eswitch pointer before registering devlink_param
        ...
      2ae9a897
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · ab4319fd
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "The most important fix is the sg one because the regression it fixes
        (spurious warning and use after final put) is already backported to
        stable.
      
        The next biggest impact is the target fix for wrong credentials used
        to load a module because it's affecting new kernels installed on
        selinux based distributions.
      
        The other three fixes are an obvious off by one and SATA protocol
        issues"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: qla2xxx: Fix off by one in qla_edif_app_getstats()
        scsi: hisi_sas: Modify the deadline for ata_wait_after_reset()
        scsi: hisi_sas: Handle the NCQ error returned by D2H frame
        scsi: target: Fix SELinux error when systemd-modules loads the target module
        scsi: sg: Avoid race in error handling & drop bogus warn
      ab4319fd
    • Linus Torvalds's avatar
      Merge tag 'loongarch-fixes-6.9-1' of... · 5de6b467
      Linus Torvalds authored
      Merge tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
      
      Pull LoongArch fixes from Huacai Chen:
      
       - make {virt, phys, page, pfn} translation work with KFENCE for
         LoongArch (otherwise NVMe and virtio-blk cannot work with KFENCE
         enabled)
      
       - update dts files for Loongson-2K series to make devices work
         correctly
      
       - fix a build error
      
      * tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
        LoongArch: Include linux/sizes.h in addrspace.h to prevent build errors
        LoongArch: Update dts for Loongson-2K2000 to support GMAC/GNET
        LoongArch: Update dts for Loongson-2K2000 to support PCI-MSI
        LoongArch: Update dts for Loongson-2K2000 to support ISA/LPC
        LoongArch: Update dts for Loongson-2K1000 to support ISA/LPC
        LoongArch: Make virt_addr_valid()/__virt_addr_valid() work with KFENCE
        LoongArch: Make {virt, phys, page, pfn} translation work with KFENCE
        mm: Move lowmem_page_address() a little later
      5de6b467
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs · e1dc191d
      Linus Torvalds authored
      Pull more bcachefs fixes from Kent Overstreet:
       "Notable user impacting bugs
      
         - On multi device filesystems, recovery was looping in
           btree_trans_too_many_iters(). This checks if a transaction has
           touched too many btree paths (because of iteration over many keys),
           and isuses a restart to drop unneeded paths.
      
           But it's now possible for some paths to exceed the previous limit
           without iteration in the interior btree update path, since the
           transaction commit will do alloc updates for every old and new
           btree node, and during journal replay we don't use the btree write
           buffer for locking reasons and thus those updates use btree paths
           when they wouldn't normally.
      
         - Fix a corner case in rebalance when moving extents on a
           durability=0 device. This wouldn't be hit when a device was
           formatted with durability=0 since in that case we'll only use it as
           a write through cache (only cached extents will live on it), but
           durability can now be changed on an existing device.
      
         - bch2_get_acl() could rarely forget to handle a transaction restart;
           this manifested as the occasional missing acl that came back after
           dropping caches.
      
         - Fix a major performance regression on high iops multithreaded write
           workloads (only since 6.9-rc1); a previous fix for a deadlock in
           the interior btree update path to check the journal watermark
           introduced a dependency on the state of btree write buffer flushing
           that we didn't want.
      
         - Assorted other repair paths and recovery fixes"
      
      * tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits)
        bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
        bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
        bcachefs: Fix a race in btree_update_nodes_written()
        bcachefs: btree_node_scan: Respect member.data_allowed
        bcachefs: Don't scan for btree nodes when we can reconstruct
        bcachefs: Fix check_topology() when using node scan
        bcachefs: fix eytzinger0_find_gt()
        bcachefs: fix bch2_get_acl() transaction restart handling
        bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
        bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
        bcachefs: Rename struct field swap to prevent macro naming collision
        MAINTAINERS: Add entry for bcachefs documentation
        Documentation: filesystems: Add bcachefs toctree
        bcachefs: JOURNAL_SPACE_LOW
        bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
        bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
        bcachefs: fix rand_delete unit test
        bcachefs: fix ! vs ~ typo in __clear_bit_le64()
        bcachefs: Fix rebalance from durability=0 device
        bcachefs: Print shutdown journal sequence number
        ...
      e1dc191d
    • Linus Torvalds's avatar
      Merge tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of... · 346668f0
      Linus Torvalds authored
      Merge tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
      
      Pull chrome platform fix from Tzung-Bi Shih:
       "Fix a NULL pointer dereference"
      
      * tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
        platform/chrome: cros_ec_uart: properly fix race condition
      346668f0
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-add-last-time-fields-in-mptcp_info' · a55b39e8
      Jakub Kicinski authored
      Matthieu Baerts says:
      
      ====================
      mptcp: add last time fields in mptcp_info
      
      These patches from Geliang add support for the "last time" field in
      MPTCP Info, and verify that the counters look valid.
      
      Patch 1 adds these counters: last_data_sent, last_data_recv and
      last_ack_recv. They are available in the MPTCP Info, so exposed via
      getsockopt(MPTCP_INFO) and the Netlink Diag interface.
      
      Patch 2 adds a test in diag.sh MPTCP selftest, to check that the
      counters have moved by at least 250ms, after having waited twice that
      time.
      
      v1: https://lore.kernel.org/r/20240405-upstream-net-next-20240405-mptcp-last-time-info-v1-0-52dc49453649@kernel.org
      ====================
      
      Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-0-f95bd6b33e51@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a55b39e8
    • Geliang Tang's avatar
      selftests: mptcp: test last time mptcp_info · 22724c89
      Geliang Tang authored
      This patch adds a new helper chk_msk_info() to show the counters in
      mptcp_info of the given info, and check that the timestamps move
      forward. Use it to show newly added last_data_sent, last_data_recv
      and last_ack_recv in mptcp_info in chk_last_time_info().
      Signed-off-by: default avatarGeliang Tang <tanggeliang@kylinos.cn>
      Reviewed-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-2-f95bd6b33e51@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      22724c89
    • Geliang Tang's avatar
      mptcp: add last time fields in mptcp_info · 18d82cde
      Geliang Tang authored
      This patch adds "last time" fields last_data_sent, last_data_recv and
      last_ack_recv in struct mptcp_sock to record the last time data_sent,
      data_recv and ack_recv happened. They all are initialized as
      tcp_jiffies32 in __mptcp_init_sock(), and updated as tcp_jiffies32 too
      when data is sent in __subflow_push_pending(), data is received in
      __mptcp_move_skbs_from_subflow(), and ack is received in ack_update_msk().
      
      Similar to tcpi_last_data_sent, tcpi_last_data_recv and tcpi_last_ack_recv
      exposed with TCP, this patch exposes the last time "an action happened" for
      MPTCP in mptcp_info, named mptcpi_last_data_sent, mptcpi_last_data_recv and
      mptcpi_last_ack_recv, calculated in mptcp_diag_fill_info() as the time
      deltas between now and the newly added last time fields in mptcp_sock.
      
      Since msk->last_ack_recv needs to be protected by mptcp_data_lock/unlock,
      and lock_sock_fast can sleep and be quite slow, move the entire
      mptcp_data_lock/unlock block after the lock/unlock_sock_fast block.
      Then mptcpi_last_data_sent and mptcpi_last_data_recv are set in
      lock/unlock_sock_fast block, while mptcpi_last_ack_recv is set in
      mptcp_data_lock/unlock block, which is protected by a spinlock and
      should not block for too long.
      
      Also add three reserved bytes in struct mptcp_info not to have holes in
      this structure exposed to userspace.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/446Signed-off-by: default avatarGeliang Tang <tanggeliang@kylinos.cn>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Reviewed-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-1-f95bd6b33e51@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      18d82cde
    • Jakub Kicinski's avatar
      Merge branch mana-ib-flex of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git · 0e36c21d
      Jakub Kicinski authored
      Erick Archer says:
      
      ====================
      mana: Add flex array to struct mana_cfg_rx_steer_req_v2 (part)
      
      The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
      trailing elements. Specifically, it uses a "mana_handle_t" array. So,
      use the preferred way in the kernel declaring a flexible array [1].
      
      At the same time, prepare for the coming implementation by GCC and Clang
      of the __counted_by attribute. Flexible array members annotated with
      __counted_by can have their accesses bounds-checked at run-time via
      CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
      strcpy/memcpy-family functions).
      
      Also, avoid the open-coded arithmetic in the memory allocator functions
      [2] using the "struct_size" macro.
      
      Moreover, use the "offsetof" helper to get the indirect table offset
      instead of the "sizeof" operator and avoid the open-coded arithmetic in
      pointers using the new flex member. This new structure member also allow
      us to remove the "req_indir_tab" variable since it is no longer needed.
      
      Now, it is also possible to use the "flex_array_size" helper to compute
      the size of these trailing elements in the "memcpy" function.
      
      Specifically, the first commit adds the flex member and the patches 2 and
      3 refactor the consumers of the "struct mana_cfg_rx_steer_req_v2".
      
      This code was detected with the help of Coccinelle, and audited and
      modified manually. The Coccinelle script used to detect this code pattern
      is the following:
      
      virtual report
      
      @rule1@
      type t1;
      type t2;
      identifier i0;
      identifier i1;
      identifier i2;
      identifier ALLOC =~ "kmalloc|kzalloc|kmalloc_node|kzalloc_node|vmalloc|vzalloc|kvmalloc|kvzalloc";
      position p1;
      @@
      
      i0 = sizeof(t1) + sizeof(t2) * i1;
      ...
      i2 = ALLOC@p1(..., i0, ...);
      
      @script:python depends on report@
      p1 << rule1.p1;
      @@
      
      msg = "WARNING: verify allocation on line %s" % (p1[0].line)
      coccilib.report.print_report(p1[0],msg)
      
      Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1]
      Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2]
      
      v1: https://lore.kernel.org/linux-hardening/AS8PR02MB7237974EF1B9BAFA618166C38B382@AS8PR02MB7237.eurprd02.prod.outlook.com/
      v2: https://lore.kernel.org/linux-hardening/AS8PR02MB723729C5A63F24C312FC9CD18B3F2@AS8PR02MB7237.eurprd02.prod.outlook.com/
      ====================
      
      Link: https://lore.kernel.org/r/AS8PR02MB72374BD1B23728F2E3C3B1A18B022@AS8PR02MB7237.eurprd02.prod.outlook.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0e36c21d
    • Erick Archer's avatar
      net: mana: Avoid open coded arithmetic · a68292eb
      Erick Archer authored
      This is an effort to get rid of all multiplications from allocation
      functions in order to prevent integer overflows [1][2].
      
      As the "req" variable is a pointer to "struct mana_cfg_rx_steer_req_v2"
      and this structure ends in a flexible array:
      
      struct mana_cfg_rx_steer_req_v2 {
              [...]
              mana_handle_t indir_tab[] __counted_by(num_indir_entries);
      };
      
      the preferred way in the kernel is to use the struct_size() helper to
      do the arithmetic instead of the calculation "size + size * count" in
      the kzalloc() function.
      
      Moreover, use the "offsetof" helper to get the indirect table offset
      instead of the "sizeof" operator and avoid the open-coded arithmetic in
      pointers using the new flex member. This new structure member also allow
      us to remove the "req_indir_tab" variable since it is no longer needed.
      
      Now, it is also possible to use the "flex_array_size" helper to compute
      the size of these trailing elements in the "memcpy" function.
      
      This way, the code is more readable and safer.
      
      This code was detected with the help of Coccinelle, and audited and
      modified manually.
      
      Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
      Link: https://github.com/KSPP/linux/issues/160 [2]
      Signed-off-by: default avatarErick Archer <erick.archer@outlook.com>
      Link: https://lore.kernel.org/r/AS8PR02MB7237A21355C86EC0DCC0D83B8B022@AS8PR02MB7237.eurprd02.prod.outlook.comReviewed-by: default avatarJustin Stitt <justinstitt@google.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      a68292eb
    • Erick Archer's avatar
      RDMA/mana_ib: Prefer struct_size over open coded arithmetic · 29b8e13a
      Erick Archer authored
      This is an effort to get rid of all multiplications from allocation
      functions in order to prevent integer overflows [1][2].
      
      As the "req" variable is a pointer to "struct mana_cfg_rx_steer_req_v2"
      and this structure ends in a flexible array:
      
      struct mana_cfg_rx_steer_req_v2 {
      	[...]
              mana_handle_t indir_tab[] __counted_by(num_indir_entries);
      };
      
      the preferred way in the kernel is to use the struct_size() helper to
      do the arithmetic instead of the calculation "size + size * count" in
      the kzalloc() function.
      
      Moreover, use the "offsetof" helper to get the indirect table offset
      instead of the "sizeof" operator and avoid the open-coded arithmetic in
      pointers using the new flex member. This new structure member also allow
      us to remove the "req_indir_tab" variable since it is no longer needed.
      
      This way, the code is more readable and safer.
      
      This code was detected with the help of Coccinelle, and audited and
      modified manually.
      
      Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
      Link: https://github.com/KSPP/linux/issues/160 [2]
      Signed-off-by: default avatarErick Archer <erick.archer@outlook.com>
      Link: https://lore.kernel.org/r/AS8PR02MB72375EB06EE1A84A67BE722E8B022@AS8PR02MB7237.eurprd02.prod.outlook.comReviewed-by: default avatarLong Li <longli@microsoft.com>
      Reviewed-by: default avatarJustin Stitt <justinstitt@google.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      29b8e13a
    • Erick Archer's avatar
      net: mana: Add flex array to struct mana_cfg_rx_steer_req_v2 · bfec4e18
      Erick Archer authored
      The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
      trailing elements. Specifically, it uses a "mana_handle_t" array. So,
      use the preferred way in the kernel declaring a flexible array [1].
      
      At the same time, prepare for the coming implementation by GCC and Clang
      of the __counted_by attribute. Flexible array members annotated with
      __counted_by can have their accesses bounds-checked at run-time via
      CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
      strcpy/memcpy-family functions).
      
      This is a previous step to refactor the two consumers of this structure.
      
       drivers/infiniband/hw/mana/qp.c
       drivers/net/ethernet/microsoft/mana/mana_en.c
      
      The ultimate goal is to avoid the open-coded arithmetic in the memory
      allocator functions [2] using the "struct_size" macro.
      
      Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1]
      Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2]
      Signed-off-by: default avatarErick Archer <erick.archer@outlook.com>
      Link: https://lore.kernel.org/r/AS8PR02MB7237E2900247571C9CB84C678B022@AS8PR02MB7237.eurprd02.prod.outlook.comReviewed-by: default avatarLong Li <longli@microsoft.com>
      Reviewed-by: default avatarJustin Stitt <justinstitt@google.com>
      Signed-off-by: default avatarLeon Romanovsky <leon@kernel.org>
      bfec4e18
    • Paolo Abeni's avatar
      Merge branch 'ena-driver-bug-fixes' · 4e1ad31c
      Paolo Abeni authored
      David Arinzon says:
      
      ====================
      ENA driver bug fixes
      
      From: David Arinzon <darinzon@amazon.com>
      
      This patchset contains multiple bug fixes for the
      ENA driver.
      ====================
      
      Link: https://lore.kernel.org/r/20240410091358.16289-1-darinzon@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4e1ad31c
    • David Arinzon's avatar
      net: ena: Set tx_info->xdpf value to NULL · 36a1ca01
      David Arinzon authored
      The patch mentioned in the `Fixes` tag removed the explicit assignment
      of tx_info->xdpf to NULL with the justification that there's no need
      to set tx_info->xdpf to NULL and tx_info->num_of_bufs to 0 in case
      of a mapping error. Both values won't be used once the mapping function
      returns an error, and their values would be overridden by the next
      transmitted packet.
      
      While both values do indeed get overridden in the next transmission
      call, the value of tx_info->xdpf is also used to check whether a TX
      descriptor's transmission has been completed (i.e. a completion for it
      was polled).
      
      An example scenario:
      1. Mapping failed, tx_info->xdpf wasn't set to NULL
      2. A VF reset occurred leading to IO resource destruction and
         a call to ena_free_tx_bufs() function
      3. Although the descriptor whose mapping failed was freed by the
         transmission function, it still passes the check
           if (!tx_info->skb)
      
         (skb and xdp_frame are in a union)
      4. The xdp_frame associated with the descriptor is freed twice
      
      This patch returns the assignment of NULL to tx_info->xdpf to make the
      cleaning function knows that the descriptor is already freed.
      
      Fixes: 504fd6a5 ("net: ena: fix DMA mapping function issues in XDP")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      36a1ca01
    • David Arinzon's avatar
      net: ena: Fix incorrect descriptor free behavior · bf02d9fe
      David Arinzon authored
      ENA has two types of TX queues:
      - queues which only process TX packets arriving from the network stack
      - queues which only process TX packets forwarded to it by XDP_REDIRECT
        or XDP_TX instructions
      
      The ena_free_tx_bufs() cycles through all descriptors in a TX queue
      and unmaps + frees every descriptor that hasn't been acknowledged yet
      by the device (uncompleted TX transactions).
      The function assumes that the processed TX queue is necessarily from
      the first category listed above and ends up using napi_consume_skb()
      for descriptors belonging to an XDP specific queue.
      
      This patch solves a bug in which, in case of a VF reset, the
      descriptors aren't freed correctly, leading to crashes.
      
      Fixes: 548c4940 ("net: ena: Implement XDP_TX action")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      bf02d9fe
    • David Arinzon's avatar
      net: ena: Wrong missing IO completions check order · f7e41718
      David Arinzon authored
      Missing IO completions check is called every second (HZ jiffies).
      This commit fixes several issues with this check:
      
      1. Duplicate queues check:
         Max of 4 queues are scanned on each check due to monitor budget.
         Once reaching the budget, this check exits under the assumption that
         the next check will continue to scan the remainder of the queues,
         but in practice, next check will first scan the last already scanned
         queue which is not necessary and may cause the full queue scan to
         last a couple of seconds longer.
         The fix is to start every check with the next queue to scan.
         For example, on 8 IO queues:
         Bug: [0,1,2,3], [3,4,5,6], [6,7]
         Fix: [0,1,2,3], [4,5,6,7]
      
      2. Unbalanced queues check:
         In case the number of active IO queues is not a multiple of budget,
         there will be checks which don't utilize the full budget
         because the full scan exits when reaching the last queue id.
         The fix is to run every TX completion check with exact queue budget
         regardless of the queue id.
         For example, on 7 IO queues:
         Bug: [0,1,2,3], [4,5,6], [0,1,2,3]
         Fix: [0,1,2,3], [4,5,6,0], [1,2,3,4]
         The budget may be lowered in case the number of IO queues is less
         than the budget (4) to make sure there are no duplicate queues on
         the same check.
         For example, on 3 IO queues:
         Bug: [0,1,2,0], [1,2,0,1]
         Fix: [0,1,2], [0,1,2]
      
      Fixes: 1738cd3e ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)")
      Signed-off-by: default avatarAmit Bernstein <amitbern@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      f7e41718
    • David Arinzon's avatar
      net: ena: Fix potential sign extension issue · 713a8519
      David Arinzon authored
      Small unsigned types are promoted to larger signed types in
      the case of multiplication, the result of which may overflow.
      In case the result of such a multiplication has its MSB
      turned on, it will be sign extended with '1's.
      This changes the multiplication result.
      
      Code example of the phenomenon:
      -------------------------------
      u16 x, y;
      size_t z1, z2;
      
      x = y = 0xffff;
      printk("x=%x y=%x\n",x,y);
      
      z1 = x*y;
      z2 = (size_t)x*y;
      
      printk("z1=%lx z2=%lx\n", z1, z2);
      
      Output:
      -------
      x=ffff y=ffff
      z1=fffffffffffe0001 z2=fffe0001
      
      The expected result of ffff*ffff is fffe0001, and without the
      explicit casting to avoid the unwanted sign extension we got
      fffffffffffe0001.
      
      This commit adds an explicit casting to avoid the sign extension
      issue.
      
      Fixes: 689b2bda ("net: ena: add functions for handling Low Latency Queues in ena_com")
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      713a8519
    • Paolo Abeni's avatar
      Merge tag 'for-net-2024-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · fe3eb406
      Paolo Abeni authored
      Luiz Augusto von Dentz says:
      
      ====================
      bluetooth pull request for net:
      
        - L2CAP: Don't double set the HCI_CONN_MGMT_CONNECTED bit
        - Fix memory leak in hci_req_sync_complete
        - hci_sync: Fix using the same interval and window for Coded PHY
        - Fix not validating setsockopt user input
      
      * tag 'for-net-2024-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
        Bluetooth: l2cap: Don't double set the HCI_CONN_MGMT_CONNECTED bit
        Bluetooth: hci_sock: Fix not validating setsockopt user input
        Bluetooth: ISO: Fix not validating setsockopt user input
        Bluetooth: L2CAP: Fix not validating setsockopt user input
        Bluetooth: RFCOMM: Fix not validating setsockopt user input
        Bluetooth: SCO: Fix not validating setsockopt user input
        Bluetooth: Fix memory leak in hci_req_sync_complete()
        Bluetooth: hci_sync: Fix using the same interval and window for Coded PHY
        Bluetooth: ISO: Don't reject BT_ISO_QOS if parameters are unset
      ====================
      
      Link: https://lore.kernel.org/r/20240410191610.4156653-1-luiz.dentz@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      fe3eb406
    • Michal Luczaj's avatar
      af_unix: Fix garbage collector racing against connect() · 47d8ac01
      Michal Luczaj authored
      Garbage collector does not take into account the risk of embryo getting
      enqueued during the garbage collection. If such embryo has a peer that
      carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
      different set of children. Leading to an incorrectly elevated inflight
      count, and then a dangling pointer within the gc_inflight_list.
      
      sockets are AF_UNIX/SOCK_STREAM
      S is an unconnected socket
      L is a listening in-flight socket bound to addr, not in fdtable
      V's fd will be passed via sendmsg(), gets inflight count bumped
      
      connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
      ----------------	-------------------------	-----------
      
      NS = unix_create1()
      skb1 = sock_wmalloc(NS)
      L = unix_find_other(addr)
      unix_state_lock(L)
      unix_peer(S) = NS
      			// V count=1 inflight=0
      
       			NS = unix_peer(S)
       			skb2 = sock_alloc()
      			skb_queue_tail(NS, skb2[V])
      
      			// V became in-flight
      			// V count=2 inflight=1
      
      			close(V)
      
      			// V count=1 inflight=1
      			// GC candidate condition met
      
      						for u in gc_inflight_list:
      						  if (total_refs == inflight_refs)
      						    add u to gc_candidates
      
      						// gc_candidates={L, V}
      
      						for u in gc_candidates:
      						  scan_children(u, dec_inflight)
      
      						// embryo (skb1) was not
      						// reachable from L yet, so V's
      						// inflight remains unchanged
      __skb_queue_tail(L, skb1)
      unix_state_unlock(L)
      						for u in gc_candidates:
      						  if (u.inflight)
      						    scan_children(u, inc_inflight_move_tail)
      
      						// V count=1 inflight=2 (!)
      
      If there is a GC-candidate listening socket, lock/unlock its state. This
      makes GC wait until the end of any ongoing connect() to that socket. After
      flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
      there is another embryo coming, it can not possibly carry SCM_RIGHTS. At
      this point, unix_inflight() can not happen because unix_gc_lock is already
      taken. Inflight graph remains unaffected.
      
      Fixes: 1fd05ba5 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240409201047.1032217-1-mhal@rbox.coSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      47d8ac01
    • Arınç ÜNAL's avatar
      net: dsa: mt7530: trap link-local frames regardless of ST Port State · 17c56011
      Arınç ÜNAL authored
      In Clause 5 of IEEE Std 802-2014, two sublayers of the data link layer
      (DLL) of the Open Systems Interconnection basic reference model (OSI/RM)
      are described; the medium access control (MAC) and logical link control
      (LLC) sublayers. The MAC sublayer is the one facing the physical layer.
      
      In 8.2 of IEEE Std 802.1Q-2022, the Bridge architecture is described. A
      Bridge component comprises a MAC Relay Entity for interconnecting the Ports
      of the Bridge, at least two Ports, and higher layer entities with at least
      a Spanning Tree Protocol Entity included.
      
      Each Bridge Port also functions as an end station and shall provide the MAC
      Service to an LLC Entity. Each instance of the MAC Service is provided to a
      distinct LLC Entity that supports protocol identification, multiplexing,
      and demultiplexing, for protocol data unit (PDU) transmission and reception
      by one or more higher layer entities.
      
      It is described in 8.13.9 of IEEE Std 802.1Q-2022 that in a Bridge, the LLC
      Entity associated with each Bridge Port is modeled as being directly
      connected to the attached Local Area Network (LAN).
      
      On the switch with CPU port architecture, CPU port functions as Management
      Port, and the Management Port functionality is provided by software which
      functions as an end station. Software is connected to an IEEE 802 LAN that
      is wholly contained within the system that incorporates the Bridge.
      Software provides access to the LLC Entity associated with each Bridge Port
      by the value of the source port field on the special tag on the frame
      received by software.
      
      We call frames that carry control information to determine the active
      topology and current extent of each Virtual Local Area Network (VLAN),
      i.e., spanning tree or Shortest Path Bridging (SPB) and Multiple VLAN
      Registration Protocol Data Units (MVRPDUs), and frames from other link
      constrained protocols, such as Extensible Authentication Protocol over LAN
      (EAPOL) and Link Layer Discovery Protocol (LLDP), link-local frames. They
      are not forwarded by a Bridge. Permanently configured entries in the
      filtering database (FDB) ensure that such frames are discarded by the
      Forwarding Process. In 8.6.3 of IEEE Std 802.1Q-2022, this is described in
      detail:
      
      Each of the reserved MAC addresses specified in Table 8-1
      (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]) shall be
      permanently configured in the FDB in C-VLAN components and ERs.
      
      Each of the reserved MAC addresses specified in Table 8-2
      (01-80-C2-00-00-[01,02,03,04,05,06,07,08,09,0A,0E]) shall be permanently
      configured in the FDB in S-VLAN components.
      
      Each of the reserved MAC addresses specified in Table 8-3
      (01-80-C2-00-00-[01,02,04,0E]) shall be permanently configured in the FDB
      in TPMR components.
      
      The FDB entries for reserved MAC addresses shall specify filtering for all
      Bridge Ports and all VIDs. Management shall not provide the capability to
      modify or remove entries for reserved MAC addresses.
      
      The addresses in Table 8-1, Table 8-2, and Table 8-3 determine the scope of
      propagation of PDUs within a Bridged Network, as follows:
      
        The Nearest Bridge group address (01-80-C2-00-00-0E) is an address that
        no conformant Two-Port MAC Relay (TPMR) component, Service VLAN (S-VLAN)
        component, Customer VLAN (C-VLAN) component, or MAC Bridge can forward.
        PDUs transmitted using this destination address, or any other addresses
        that appear in Table 8-1, Table 8-2, and Table 8-3
        (01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]), can
        therefore travel no further than those stations that can be reached via a
        single individual LAN from the originating station.
      
        The Nearest non-TPMR Bridge group address (01-80-C2-00-00-03), is an
        address that no conformant S-VLAN component, C-VLAN component, or MAC
        Bridge can forward; however, this address is relayed by a TPMR component.
        PDUs using this destination address, or any of the other addresses that
        appear in both Table 8-1 and Table 8-2 but not in Table 8-3
        (01-80-C2-00-00-[00,03,05,06,07,08,09,0A,0B,0C,0D,0F]), will be relayed
        by any TPMRs but will propagate no further than the nearest S-VLAN
        component, C-VLAN component, or MAC Bridge.
      
        The Nearest Customer Bridge group address (01-80-C2-00-00-00) is an
        address that no conformant C-VLAN component, MAC Bridge can forward;
        however, it is relayed by TPMR components and S-VLAN components. PDUs
        using this destination address, or any of the other addresses that appear
        in Table 8-1 but not in either Table 8-2 or Table 8-3
        (01-80-C2-00-00-[00,0B,0C,0D,0F]), will be relayed by TPMR components and
        S-VLAN components but will propagate no further than the nearest C-VLAN
        component or MAC Bridge.
      
      Because the LLC Entity associated with each Bridge Port is provided via CPU
      port, we must not filter these frames but forward them to CPU port.
      
      In a Bridge, the transmission Port is majorly decided by ingress and egress
      rules, FDB, and spanning tree Port State functions of the Forwarding
      Process. For link-local frames, only CPU port should be designated as
      destination port in the FDB, and the other functions of the Forwarding
      Process must not interfere with the decision of the transmission Port. We
      call this process trapping frames to CPU port.
      
      Therefore, on the switch with CPU port architecture, link-local frames must
      be trapped to CPU port, and certain link-local frames received by a Port of
      a Bridge comprising a TPMR component or an S-VLAN component must be
      excluded from it.
      
      A Bridge of the switch with CPU port architecture cannot comprise a
      Two-Port MAC Relay (TPMR) component as a TPMR component supports only a
      subset of the functionality of a MAC Bridge. A Bridge comprising two Ports
      (Management Port doesn't count) of this architecture will either function
      as a standard MAC Bridge or a standard VLAN Bridge.
      
      Therefore, a Bridge of this architecture can only comprise S-VLAN
      components, C-VLAN components, or MAC Bridge components. Since there's no
      TPMR component, we don't need to relay PDUs using the destination addresses
      specified on the Nearest non-TPMR section, and the proportion of the
      Nearest Customer Bridge section where they must be relayed by TPMR
      components.
      
      One option to trap link-local frames to CPU port is to add static FDB
      entries with CPU port designated as destination port. However, because that
      Independent VLAN Learning (IVL) is being used on every VID, each entry only
      applies to a single VLAN Identifier (VID). For a Bridge comprising a MAC
      Bridge component or a C-VLAN component, there would have to be 16 times
      4096 entries. This switch intellectual property can only hold a maximum of
      2048 entries. Using this option, there also isn't a mechanism to prevent
      link-local frames from being discarded when the spanning tree Port State of
      the reception Port is discarding.
      
      The remaining option is to utilise the BPC, RGAC1, RGAC2, RGAC3, and RGAC4
      registers. Whilst this applies to every VID, it doesn't contain all of the
      reserved MAC addresses without affecting the remaining Standard Group MAC
      Addresses. The REV_UN frame tag utilised using the RGAC4 register covers
      the remaining 01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F] destination
      addresses. It also includes the 01-80-C2-00-00-22 to 01-80-C2-00-00-FF
      destination addresses which may be relayed by MAC Bridges or VLAN Bridges.
      The latter option provides better but not complete conformance.
      
      This switch intellectual property also does not provide a mechanism to trap
      link-local frames with specific destination addresses to CPU port by
      Bridge, to conform to the filtering rules for the distinct Bridge
      components.
      
      Therefore, regardless of the type of the Bridge component, link-local
      frames with these destination addresses will be trapped to CPU port:
      
      01-80-C2-00-00-[00,01,02,03,0E]
      
      In a Bridge comprising a MAC Bridge component or a C-VLAN component:
      
        Link-local frames with these destination addresses won't be trapped to
        CPU port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F]
      
      In a Bridge comprising an S-VLAN component:
      
        Link-local frames with these destination addresses will be trapped to CPU
        port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-00
      
        Link-local frames with these destination addresses won't be trapped to
        CPU port which won't conform to IEEE Std 802.1Q-2022:
      
        01-80-C2-00-00-[04,05,06,07,08,09,0A]
      
      Currently on this switch intellectual property, if the spanning tree Port
      State of the reception Port is discarding, link-local frames will be
      discarded.
      
      To trap link-local frames regardless of the spanning tree Port State, make
      the switch regard them as Bridge Protocol Data Units (BPDUs). This switch
      intellectual property only lets the frames regarded as BPDUs bypass the
      spanning tree Port State function of the Forwarding Process.
      
      With this change, the only remaining interference is the ingress rules.
      When the reception Port has no PVID assigned on software, VLAN-untagged
      frames won't be allowed in. There doesn't seem to be a mechanism on the
      switch intellectual property to have link-local frames bypass this function
      of the Forwarding Process.
      
      Fixes: b8f126a8 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
      Reviewed-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Link: https://lore.kernel.org/r/20240409-b4-for-net-mt7530-fix-link-local-when-stp-discarding-v2-1-07b1150164ac@arinc9.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      17c56011
    • Gerd Bayer's avatar
      Revert "s390/ism: fix receive message buffer allocation" · d51dc8dd
      Gerd Bayer authored
      This reverts commit 58effa34.
      Review was not finished on this patch. So it's not ready for
      upstreaming.
      Signed-off-by: default avatarGerd Bayer <gbayer@linux.ibm.com>
      Link: https://lore.kernel.org/r/20240409113753.2181368-1-gbayer@linux.ibm.com
      Fixes: 58effa34 ("s390/ism: fix receive message buffer allocation")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d51dc8dd
    • Daniel Machon's avatar
      net: sparx5: fix wrong config being used when reconfiguring PCS · 33623113
      Daniel Machon authored
      The wrong port config is being used if the PCS is reconfigured. Fix this
      by correctly using the new config instead of the old one.
      
      Fixes: 946e7fd5 ("net: sparx5: add port module support")
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240409-link-mode-reconfiguration-fix-v2-1-db6a507f3627@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      33623113
    • Arnd Bergmann's avatar
      net/mlx5: fix possible stack overflows · fe87922c
      Arnd Bergmann authored
      A couple of debug functions use a 512 byte temporary buffer and call another
      function that has another buffer of the same size, which in turn exceeds the
      usual warning limit for excessive stack usage:
      
      drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:1073:1: error: stack frame size (1448) exceeds limit (1024) in 'dr_dump_start' [-Werror,-Wframe-larger-than]
      dr_dump_start(struct seq_file *file, loff_t *pos)
      drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:1009:1: error: stack frame size (1120) exceeds limit (1024) in 'dr_dump_domain' [-Werror,-Wframe-larger-than]
      dr_dump_domain(struct seq_file *file, struct mlx5dr_domain *dmn)
      drivers/net/ethernet/mellanox/mlx5/core/steering/dr_dbg.c:705:1: error: stack frame size (1104) exceeds limit (1024) in 'dr_dump_matcher_rx_tx' [-Werror,-Wframe-larger-than]
      dr_dump_matcher_rx_tx(struct seq_file *file, bool is_rx,
      
      Rework these so that each of the various code paths only ever has one of
      these buffers in it, and exactly the functions that declare one have
      the 'noinline_for_stack' annotation that prevents them from all being
      inlined into the same caller.
      
      Fixes: 917d1e79 ("net/mlx5: DR, Change SWS usage to debug fs seq_file interface")
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/all/20240219100506.648089-1-arnd@kernel.org/Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240408074142.3007036-1-arnd@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fe87922c
    • Jakub Kicinski's avatar
      Merge branch 'bnxt_en-updates-for-net-next' · 872c00cc
      Jakub Kicinski authored
      Michael Chan says:
      
      ====================
      bnxt_en: Updates for net-next
      
      The first patch prevents a driver crash when RSS contexts are
      configred in ifdown state.  Patches 2 to 6 are improvements for
      managing MSIX for the aux device (for RoCE).  The existing
      scheme statically carves out the MSIX vectors for RoCE even if
      the RoCE driver is not loaded.  The new scheme adds flexibility
      and allows the L2 driver to use the RoCE MSIX vectors if needed
      when they are unused by the RoCE driver.  The last patch updates
      the MODULE_DESCRIPTION().
      ====================
      
      Link: https://lore.kernel.org/r/20240409215431.41424-1-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      872c00cc
    • Michael Chan's avatar
      bnxt_en: Update MODULE_DESCRIPTION · 008ce0fd
      Michael Chan authored
      Update MODULE_DESCRIPTION to the more generic adapter family name.
      The old name only includes the first generation of supported
      adapters.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240409215431.41424-8-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      008ce0fd
    • Vikas Gupta's avatar
      bnxt_en: Utilize ulp client resources if RoCE is not registered · d630624e
      Vikas Gupta authored
      If the RoCE driver is not registered for a RoCE capable device, add
      flexibility to use the RoCE resources (MSIX/NQs) for L2 purposes,
      such as additional rings configured by the user or for XDP.
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Signed-off-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240409215431.41424-7-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d630624e
    • Vikas Gupta's avatar
      bnxt_en: Change MSIX/NQs allocation policy · 2e4592dc
      Vikas Gupta authored
      The existing scheme sets aside a number of MSIX/NQs for the RoCE
      driver whether the RoCE driver is registered or not.  This scheme
      is not flexible and limits the resources available for the L2 rings
      if RoCE is never used.
      
      Modify the scheme so that the RoCE MSIX/NQs can be used by the L2
      driver if they are not used for RoCE.  The MSIX/NQs are now
      represented by 3 fields.  bp->ulp_num_msix_want contains the
      desired default value, edev->ulp_num_msix_vec contains the
      available value (but not necessarily in use), and
      ulp_tbl->msix_requested contains the actual value in use by RoCE.
      
      The L2 driver can dip into edev->ulp_num_msix_vec if necessary.
      
      We need to add rtnl_lock() back in bnxt_register_dev() and
      bnxt_unregister_dev() to synchronize the MSIX usage between L2 and
      RoCE.
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Signed-off-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240409215431.41424-6-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2e4592dc
    • Vikas Gupta's avatar
      bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions · 194fad5b
      Vikas Gupta authored
      In its current form, bnxt_rdma_aux_device_init() not only initializes
      the necessary data structures of the newly created aux device but also
      adds the aux device into the aux bus subsytem. Refactor the logic into
      separate functions, first function to initialize the aux device along
      with the required resources and second, to actually add the device to
      the aux bus subsytem.
      This separation helps to create bnxt_en_dev much earlier and save its
      resources separately.
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Signed-off-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240409215431.41424-5-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      194fad5b
    • Vikas Gupta's avatar
      bnxt_en: Remove unneeded MSIX base structure fields and code · b58f5a9c
      Vikas Gupta authored
      Ever since commit:
      
      30343221 ("bnxt_en: Remove runtime interrupt vector allocation")
      
      The MSIX base vector is effectively always 0.  Remove all unneeded
      structure fields and code referencing the MSIX base.
      Reviewed-by: default avatarSomnath Kotur <somnath.kotur@broadcom.com>
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Signed-off-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240409215431.41424-4-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b58f5a9c
    • Kalesh AP's avatar
      bnxt_en: Remove a redundant NULL check in bnxt_register_dev() · 43226dcc
      Kalesh AP authored
      The memory for "edev->ulp_tbl" is allocated inside the
      bnxt_rdma_aux_device_init() function. If it fails, the driver
      will not create the auxiliary device for RoCE. Hence the NULL
      check inside bnxt_register_dev() is unnecessary.
      Reviewed-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Reviewed-by: default avatarSomnath Kotur <somnath.kotur@broadcom.com>
      Signed-off-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240409215431.41424-3-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      43226dcc