1. 22 Aug, 2024 22 commits
  2. 21 Aug, 2024 2 commits
  3. 20 Aug, 2024 16 commits
    • James Chapman's avatar
      l2tp: use skb_queue_purge in l2tp_ip_destroy_sock · bc3dd9ed
      James Chapman authored
      Recent commit ed8ebee6 ("l2tp: have l2tp_ip_destroy_sock use
      ip_flush_pending_frames") was incorrect in that l2tp_ip does not use
      socket cork and ip_flush_pending_frames is for sockets that do. Use
      __skb_queue_purge instead and remove the unnecessary lock.
      
      Also unexport ip_flush_pending_frames since it was originally exported
      in commit 4ff88634 ("ipv4: export ip_flush_pending_frames") for
      l2tp and is not used by other modules.
      
      Suggested-by: xiyou.wangcong@gmail.com
      Signed-off-by: default avatarJames Chapman <jchapman@katalix.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://patch.msgid.link/20240819143333.3204957-1-jchapman@katalix.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bc3dd9ed
    • Kuniyuki Iwashima's avatar
      af_unix: Don't call skb_get() for OOB skb. · 8594d9b8
      Kuniyuki Iwashima authored
      Since introduced, OOB skb holds an additional reference count with no
      special reason and caused many issues.
      
      Also, kfree_skb() and consume_skb() are used to decrement the count,
      which is confusing.
      
      Let's drop the unnecessary skb_get() in queue_oob() and corresponding
      kfree_skb(), consume_skb(), and skb_unref().
      
      Now unix_sk(sk)->oob_skb is just a pointer to skb in the receive queue,
      so special handing is no longer needed in GC.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://patch.msgid.link/20240816233921.57800-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8594d9b8
    • Krzysztof Kozlowski's avatar
      dt-bindings: net: socionext,uniphier-ave4: add top-level constraints · 2862c934
      Krzysztof Kozlowski authored
      Properties with variable number of items per each device are expected to
      have widest constraints in top-level "properties:" block and further
      customized (narrowed) in "if:then:".  Add missing top-level constraints
      for clock-names and reset-names.
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Reviewed-by: default avatarRob Herring (Arm) <robh@kernel.org>
      Link: https://patch.msgid.link/20240818172905.121829-4-krzysztof.kozlowski@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2862c934
    • Krzysztof Kozlowski's avatar
      dt-bindings: net: renesas,etheravb: add top-level constraints · 70d16e13
      Krzysztof Kozlowski authored
      Properties with variable number of items per each device are expected to
      have widest constraints in top-level "properties:" block and further
      customized (narrowed) in "if:then:".  Add missing top-level constraints
      for reg, clocks, clock-names, interrupts and interrupt-names.
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Reviewed-by: default avatarRob Herring (Arm) <robh@kernel.org>
      Link: https://patch.msgid.link/20240818172905.121829-3-krzysztof.kozlowski@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      70d16e13
    • Krzysztof Kozlowski's avatar
      dt-bindings: net: mediatek,net: add top-level constraints · 06ab21c3
      Krzysztof Kozlowski authored
      Properties with variable number of items per each device are expected to
      have widest constraints in top-level "properties:" block and further
      customized (narrowed) in "if:then:".  Add missing top-level constraints
      for clocks and clock-names.
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Reviewed-by: default avatarRob Herring (Arm) <robh@kernel.org>
      Link: https://patch.msgid.link/20240818172905.121829-2-krzysztof.kozlowski@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      06ab21c3
    • Krzysztof Kozlowski's avatar
      dt-bindings: net: mediatek,net: narrow interrupts per variants · 55da77de
      Krzysztof Kozlowski authored
      Each variable-length property like interrupts must have fixed
      constraints on number of items for given variant in binding.  The
      clauses in "if:then:" block should define both limits: upper and lower.
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Reviewed-by: default avatarRob Herring (Arm) <robh@kernel.org>
      Link: https://patch.msgid.link/20240818172905.121829-1-krzysztof.kozlowski@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      55da77de
    • Gal Pressman's avatar
      net: Silence false field-spanning write warning in metadata_dst memcpy · 13cfd6a6
      Gal Pressman authored
      When metadata_dst struct is allocated (using metadata_dst_alloc()), it
      reserves room for options at the end of the struct.
      
      Change the memcpy() to unsafe_memcpy() as it is guaranteed that enough
      room (md_size bytes) was allocated and the field-spanning write is
      intentional.
      
      This resolves the following warning:
      	------------[ cut here ]------------
      	memcpy: detected field-spanning write (size 104) of single field "&new_md->u.tun_info" at include/net/dst_metadata.h:166 (size 96)
      	WARNING: CPU: 2 PID: 391470 at include/net/dst_metadata.h:166 tun_dst_unclone+0x114/0x138 [geneve]
      	Modules linked in: act_tunnel_key geneve ip6_udp_tunnel udp_tunnel act_vlan act_mirred act_skbedit cls_matchall nfnetlink_cttimeout act_gact cls_flower sch_ingress sbsa_gwdt ipmi_devintf ipmi_msghandler xfrm_interface xfrm6_tunnel tunnel6 tunnel4 xfrm_user xfrm_algo nvme_fabrics overlay optee openvswitch nsh nf_conncount ib_srp scsi_transport_srp rpcrdma rdma_ucm ib_iser rdma_cm ib_umad iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm uio_pdrv_genirq uio mlxbf_pmc pwr_mlxbf mlxbf_bootctl bluefield_edac nft_chain_nat binfmt_misc xt_MASQUERADE nf_nat xt_tcpmss xt_NFLOG nfnetlink_log xt_recent xt_hashlimit xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_mark xt_comment ipt_REJECT nf_reject_ipv4 nft_compat nf_tables nfnetlink sch_fq_codel dm_multipath fuse efi_pstore ip_tables btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq raid1 raid0 nvme nvme_core mlx5_ib ib_uverbs ib_core ipv6 crc_ccitt mlx5_core crct10dif_ce mlxfw
      	 psample i2c_mlxbf gpio_mlxbf2 mlxbf_gige mlxbf_tmfifo
      	CPU: 2 PID: 391470 Comm: handler6 Not tainted 6.10.0-rc1 #1
      	Hardware name: https://www.mellanox.com BlueField SoC/BlueField SoC, BIOS 4.5.0.12993 Dec  6 2023
      	pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      	pc : tun_dst_unclone+0x114/0x138 [geneve]
      	lr : tun_dst_unclone+0x114/0x138 [geneve]
      	sp : ffffffc0804533f0
      	x29: ffffffc0804533f0 x28: 000000000000024e x27: 0000000000000000
      	x26: ffffffdcfc0e8e40 x25: ffffff8086fa6600 x24: ffffff8096a0c000
      	x23: 0000000000000068 x22: 0000000000000008 x21: ffffff8092ad7000
      	x20: ffffff8081e17900 x19: ffffff8092ad7900 x18: 00000000fffffffd
      	x17: 0000000000000000 x16: ffffffdcfa018488 x15: 695f6e75742e753e
      	x14: 2d646d5f77656e26 x13: 6d5f77656e262220 x12: 646c65696620656c
      	x11: ffffffdcfbe33ae8 x10: ffffffdcfbe1baa8 x9 : ffffffdcfa0a4c10
      	x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
      	x5 : ffffff83fdeeb010 x4 : 0000000000000000 x3 : 0000000000000027
      	x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80913f6780
      	Call trace:
      	 tun_dst_unclone+0x114/0x138 [geneve]
      	 geneve_xmit+0x214/0x10e0 [geneve]
      	 dev_hard_start_xmit+0xc0/0x220
      	 __dev_queue_xmit+0xa14/0xd38
      	 dev_queue_xmit+0x14/0x28 [openvswitch]
      	 ovs_vport_send+0x98/0x1c8 [openvswitch]
      	 do_output+0x80/0x1a0 [openvswitch]
      	 do_execute_actions+0x172c/0x1958 [openvswitch]
      	 ovs_execute_actions+0x64/0x1a8 [openvswitch]
      	 ovs_packet_cmd_execute+0x258/0x2d8 [openvswitch]
      	 genl_family_rcv_msg_doit+0xc8/0x138
      	 genl_rcv_msg+0x1ec/0x280
      	 netlink_rcv_skb+0x64/0x150
      	 genl_rcv+0x40/0x60
      	 netlink_unicast+0x2e4/0x348
      	 netlink_sendmsg+0x1b0/0x400
      	 __sock_sendmsg+0x64/0xc0
      	 ____sys_sendmsg+0x284/0x308
      	 ___sys_sendmsg+0x88/0xf0
      	 __sys_sendmsg+0x70/0xd8
      	 __arm64_sys_sendmsg+0x2c/0x40
      	 invoke_syscall+0x50/0x128
      	 el0_svc_common.constprop.0+0x48/0xf0
      	 do_el0_svc+0x24/0x38
      	 el0_svc+0x38/0x100
      	 el0t_64_sync_handler+0xc0/0xc8
      	 el0t_64_sync+0x1a4/0x1a8
      	---[ end trace 0000000000000000 ]---
      Reviewed-by: default avatarCosmin Ratiu <cratiu@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarGal Pressman <gal@nvidia.com>
      Link: https://patch.msgid.link/20240818114351.3612692-1-gal@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      13cfd6a6
    • Zhang Zekun's avatar
      net: hns3: Use ARRAY_SIZE() to improve readability · 2cbece60
      Zhang Zekun authored
      There is a helper function ARRAY_SIZE() to help calculating the
      u32 array size, and we don't need to do it mannually. So, let's
      use ARRAY_SIZE() to calculate the array size, and improve the code
      readability.
      Signed-off-by: default avatarZhang Zekun <zhangzekun11@huawei.com>
      Reviewed-by: Jijie Shao<shaojijie@huawei.com>
      Link: https://patch.msgid.link/20240818052518.45489-1-zhangzekun11@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2cbece60
    • Jakub Kicinski's avatar
      selftests: net/forwarding: spawn sh inside vrf to speed up ping loop · 555e5531
      Jakub Kicinski authored
      Looking at timestamped output of netdev CI reveals that
      most of the time in forwarding tests for custom route
      hashing is spent on a single case, namely the test which
      uses ping (mausezahn does not support flow labels).
      
      On a non-debug kernel we spend 714 of 730 total test
      runtime (97%) on this test case. While having flow label
      support in a traffic gen tool / mausezahn would be best,
      we can significantly speed up the loop by putting ip vrf exec
      outside of the iteration.
      
      In a test of 1000 pings using a normal loop takes 50 seconds
      to finish. While using:
      
        ip vrf exec $vrf sh -c "$loop-body"
      
      takes 12 seconds (1/4 of the time).
      
      Some of the slowness is likely due to our inefficient virtualization
      setup, but even on my laptop running "ip link help" 16k times takes
      25-30 seconds, so I think it's worth optimizing even for fastest
      setups.
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Link: https://patch.msgid.link/20240817203659.712085-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      555e5531
    • Zhang Zekun's avatar
      net: ethernet: ibm: Simpify code with for_each_child_of_node() · 79765386
      Zhang Zekun authored
      for_each_child_of_node can help to iterate through the device_node,
      and we don't need to use while loop. No functional change with this
      conversion.
      Signed-off-by: default avatarZhang Zekun <zhangzekun11@huawei.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/20240816015837.109627-1-zhangzekun11@huawei.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      79765386
    • Paolo Abeni's avatar
      Merge branch 'preparations-for-fib-rule-dscp-selector' · 6b2efdc4
      Paolo Abeni authored
      Ido Schimmel says:
      
      ====================
      Preparations for FIB rule DSCP selector
      
      This patchset moves the masking of the upper DSCP bits in 'flowi4_tos'
      to the core instead of relying on callers of the FIB lookup API to do
      it.
      
      This will allow us to start changing users of the API to initialize the
      'flowi4_tos' field with all six bits of the DSCP field. In turn, this
      will allow us to extend FIB rules with a new DSCP selector.
      
      By masking the upper DSCP bits in the core we are able to maintain the
      behavior of the TOS selector in FIB rules and routes to only match on
      the lower DSCP bits.
      
      While working on this I found two users of the API that do not mask the
      upper DSCP bits before performing the lookup. The first is an ancient
      netlink family that is unlikely to be used. It is adjusted in patch #1
      to mask both the upper DSCP bits and the ECN bits before calling the
      API.
      
      The second user is a nftables module that differs in this regard from
      its equivalent iptables module. It is adjusted in patch #2 to invoke the
      API with the upper DSCP bits masked, like all other callers. The
      relevant selftest passed, but in the unlikely case that regressions are
      reported because of this change, we can restore the existing behavior
      using a new flow information flag as discussed here [1].
      
      The last patch moves the masking of the upper DSCP bits to the core,
      making the first two patches redundant, but I wanted to post them
      separately to call attention to the behavior change for these two users
      of the FIB lookup API.
      
      Future patchsets (around 3) will start unmasking the upper DSCP bits
      throughout the networking stack before adding support for the new FIB
      rule DSCP selector.
      
      Changes from v1 [2]:
      
      Patch #3: Include <linux/ip.h> in <linux/in_route.h> instead of
      including it in net/ip_fib.h
      
      [1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/
      [2] https://lore.kernel.org/netdev/20240725131729.1729103-1-idosch@nvidia.com/
      ====================
      
      Link: https://patch.msgid.link/20240814125224.972815-1-idosch@nvidia.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6b2efdc4
    • Ido Schimmel's avatar
      ipv4: Centralize TOS matching · 1fa3314c
      Ido Schimmel authored
      The TOS field in the IPv4 flow information structure ('flowi4_tos') is
      matched by the kernel against the TOS selector in IPv4 rules and routes.
      The field is initialized differently by different call sites. Some treat
      it as DSCP (RFC 2474) and initialize all six DSCP bits, some treat it as
      RFC 1349 TOS and initialize it using RT_TOS() and some treat it as RFC
      791 TOS and initialize it using IPTOS_RT_MASK.
      
      What is common to all these call sites is that they all initialize the
      lower three DSCP bits, which fits the TOS definition in the initial IPv4
      specification (RFC 791).
      
      Therefore, the kernel only allows configuring IPv4 FIB rules that match
      on the lower three DSCP bits which are always guaranteed to be
      initialized by all call sites:
      
       # ip -4 rule add tos 0x1c table 100
       # ip -4 rule add tos 0x3c table 100
       Error: Invalid tos.
      
      While this works, it is unlikely to be very useful. RFC 791 that
      initially defined the TOS and IP precedence fields was updated by RFC
      2474 over twenty five years ago where these fields were replaced by a
      single six bits DSCP field.
      
      Extending FIB rules to match on DSCP can be done by adding a new DSCP
      selector while maintaining the existing semantics of the TOS selector
      for applications that rely on that.
      
      A prerequisite for allowing FIB rules to match on DSCP is to adjust all
      the call sites to initialize the high order DSCP bits and remove their
      masking along the path to the core where the field is matched on.
      
      However, making this change alone will result in a behavior change. For
      example, a forwarded IPv4 packet with a DS field of 0xfc will no longer
      match a FIB rule that was configured with 'tos 0x1c'.
      
      This behavior change can be avoided by masking the upper three DSCP bits
      in 'flowi4_tos' before comparing it against the TOS selectors in FIB
      rules and routes.
      
      Implement the above by adding a new function that checks whether a given
      DSCP value matches the one specified in the IPv4 flow information
      structure and invoke it from the three places that currently match on
      'flowi4_tos'.
      
      Use RT_TOS() for the masking of 'flowi4_tos' instead of IPTOS_RT_MASK
      since the latter is not uAPI and we should be able to remove it at some
      point.
      
      Include <linux/ip.h> in <linux/in_route.h> since the former defines
      IPTOS_TOS_MASK which is used in the definition of RT_TOS() in
      <linux/in_route.h>.
      
      No regressions in FIB tests:
      
       # ./fib_tests.sh
       [...]
       Tests passed: 218
       Tests failed:   0
      
      And FIB rule tests:
      
       # ./fib_rule_tests.sh
       [...]
       Tests passed: 116
       Tests failed:   0
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      1fa3314c
    • Ido Schimmel's avatar
      netfilter: nft_fib: Mask upper DSCP bits before FIB lookup · 548a2029
      Ido Schimmel authored
      As part of its functionality, the nftables FIB expression module
      performs a FIB lookup, but unlike other users of the FIB lookup API, it
      does so without masking the upper DSCP bits. In particular, this differs
      from the equivalent iptables match ("rpfilter") that does mask the upper
      DSCP bits before the FIB lookup.
      
      Align the module to other users of the FIB lookup API and mask the upper
      DSCP bits using IPTOS_RT_MASK before the lookup.
      
      No regressions in nft_fib.sh:
      
       # ./nft_fib.sh
       PASS: fib expression did not cause unwanted packet drops
       PASS: fib expression did drop packets for 1.1.1.1
       PASS: fib expression did drop packets for 1c3::c01d
       PASS: fib expression forward check with policy based routing
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      548a2029
    • Ido Schimmel's avatar
      ipv4: Mask upper DSCP bits and ECN bits in NETLINK_FIB_LOOKUP family · 8fed5475
      Ido Schimmel authored
      The NETLINK_FIB_LOOKUP netlink family can be used to perform a FIB
      lookup according to user provided parameters and communicate the result
      back to user space.
      
      However, unlike other users of the FIB lookup API, the upper DSCP bits
      and the ECN bits of the DS field are not masked, which can result in the
      wrong result being returned.
      
      Solve this by masking the upper DSCP bits and the ECN bits using
      IPTOS_RT_MASK.
      
      The structure that communicates the request and the response is not
      exported to user space, so it is unlikely that this netlink family is
      actually in use [1].
      
      [1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8fed5475
    • Paolo Abeni's avatar
      Merge branch 'net-smc-introduce-ringbufs-usage-statistics' · ccb445ae
      Paolo Abeni authored
      Wen Gu says:
      
      ====================
      net/smc: introduce ringbufs usage statistics
      
      Currently, we have histograms that show the sizes of ringbufs that ever
      used by SMC connections. However, they are always incremental and since
      SMC allows the reuse of ringbufs, we cannot know the actual amount of
      ringbufs being allocated or actively used.
      
      So this patch set introduces statistics for the amount of ringbufs that
      actually allocated by link group and actively used by connections of a
      certain net namespace, so that we can react based on these memory usage
      information, e.g. active fallback to TCP.
      
      With appropriate adaptations of smc-tools, we can obtain these ringbufs
      usage information:
      
      $ smcr -d linkgroup
      LG-ID    : 00000500
      LG-Role  : SERV
      LG-Type  : ASYML
      VLAN     : 0
      PNET-ID  :
      Version  : 1
      Conns    : 0
      Sndbuf   : 12910592 B    <-
      RMB      : 12910592 B    <-
      
      or
      
      $ smcr -d stats
      [...]
      RX Stats
        Data transmitted (Bytes)      869225943 (869.2M)
        Total requests                 18494479
        Buffer usage  (Bytes)          12910592 (12.31M)  <-
        [...]
      
      TX Stats
        Data transmitted (Bytes)    12760884405 (12.76G)
        Total requests                 36988338
        Buffer usage  (Bytes)          12910592 (12.31M)  <-
        [...]
      [...]
      
      Change log:
      v3->v2
      - use new helper nla_put_uint() instead of nla_put_u64_64bit().
      
      v2->v1
      https://lore.kernel.org/r/20240807075939.57882-1-guwen@linux.alibaba.com/
      - remove inline keyword in .c files.
      - use local variable in macros to avoid potential side effects.
      
      v1
      https://lore.kernel.org/r/20240805090551.80786-1-guwen@linux.alibaba.com/
      ====================
      
      Link: https://patch.msgid.link/20240814130827.73321-1-guwen@linux.alibaba.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ccb445ae
    • Wen Gu's avatar
      net/smc: introduce statistics for ringbufs usage of net namespace · e0d10354
      Wen Gu authored
      The buffer size histograms in smc_stats, namely rx/tx_rmbsize, record
      the sizes of ringbufs for all connections that have ever appeared in
      the net namespace. They are incremental and we cannot know the actual
      ringbufs usage from these. So here introduces statistics for current
      ringbufs usage of existing smc connections in the net namespace into
      smc_stats, it will be incremented when new connection uses a ringbuf
      and decremented when the ringbuf is unused.
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e0d10354