Commits · 76abb5d675c49baa0bc899c6ef8b49197ac914b8 · Kirill Smelkov / linux

22 Aug, 2024 8 commits

net: xilinx: axienet: Add statistics support · 76abb5d6

Sean Anderson authored Aug 20, 2024

Add support for reading the statistics counters, if they are enabled.
The counters may be 64-bit, but we can't detect this statically as
there's no ability bit for it and the counters are read-only. Therefore,
we assume the counters are 32-bits by default. To ensure we don't miss
an overflow, we read all counters at 13-second intervals. This should be
often enough to ensure the bytes counters don't wrap at 2.5 Gbit/s.

Another complication is that the counters may be reset when the device
is reset (depending on configuration). To ensure the counters persist
across link up/down (including suspend/resume), we maintain our own
versions along with the last counter value we saw. Because we might wait
up to 100 ms for the reset to complete, we use a mutex to protect
writing hw_stats. We can't sleep in ndo_get_stats64, so we use a seqlock
to protect readers.

We don't bother disabling the refresh work when we detect 64-bit
counters. This is because the reset issue requires us to read
hw_stat_base and reset_in_progress anyway, which would still require the
seqcount. And I don't think skipping the task is worth the extra
bookkeeping.

We can't use the byte counters for either get_stats64 or
get_eth_mac_stats. This is because the byte counters include everything
in the frame (destination address to FCS, inclusive). But
rtnl_link_stats64 wants bytes excluding the FCS, and
ethtool_eth_mac_stats wants to exclude the L2 overhead (addresses and
length/type). It might be possible to calculate the byte values Linux
expects based on the frame counters, but I think it is simpler to use
the existing software counters.

get_ethtool_stats is implemented for nonstandard statistics. This
includes the aforementioned byte counters, VLAN and PFC frame
counters, and user-defined (e.g. with custom RTL) counters.
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20240820175343.760389-3-sean.anderson@linux.devSigned-off-by: Jakub Kicinski <kuba@kernel.org>

76abb5d6

net: xilinx: axienet: Report RxRject as rx_dropped · d70e3788

Sean Anderson authored Aug 20, 2024

The Receive Frame Rejected interrupt is asserted whenever there was a
receive error (bad FCS, bad length, etc.) or whenever the frame was
dropped due to a mismatched address. So this is really a combination of
rx_otherhost_dropped, rx_length_errors, rx_frame_errors, and
rx_crc_errors. Mismatched addresses are common and aren't really errors
at all (much like how fragments are normal on half-duplex links). To
avoid confusion, report these events as rx_dropped. This better
reflects what's going on: the packet was received by the MAC but dropped
before being processed.
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Link: https://patch.msgid.link/20240820175343.760389-2-sean.anderson@linux.devSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d70e3788

net: repack struct netdev_queue · 74b1e94e

Jakub Kicinski authored Aug 20, 2024

Adding the NAPI pointer to struct netdev_queue made it grow into another
cacheline, even though there was 44 bytes of padding available.

The struct was historically grouped as follows:

    /* read-mostly stuff (align) */
    /* ... random control path fields ... */
    /* write-mostly stuff (align) */
    /* ... 40 byte hole ... */
    /* struct dql (align) */

It seems that people want to add control path fields after
the read only fields. struct dql looks pretty innocent
but it forces its own alignment and nothing indicates that
there is a lot of empty space above it.

Move dql above the xmit_lock. This shifts the empty space
to the end of the struct rather than in the middle of it.
Move two example fields there to set an example.
Hopefully people will now add new fields at the end of
the struct. A lot of the read-only stuff is also control
path-only, but if we move it all we'll have another hole
in the middle.

Before:
	/* size: 384, cachelines: 6, members: 16 */
	/* sum members: 284, holes: 3, sum holes: 100 */

After:
        /* size: 320, cachelines: 5, members: 16 */
        /* sum members: 284, holes: 1, sum holes: 8 */
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240820205119.1321322-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

74b1e94e

ice: Fix a 32bit bug · 0ce054f2

Dan Carpenter authored Aug 20, 2024

BIT() is unsigned long but ->pu.flg_msk and ->pu.flg_val are u64 type.
On 32 bit systems, unsigned long is a u32 and the mismatch between u32
and u64 will break things for the high 32 bits.

Fixes: 9a4c07aa ("ice: add parser execution main loop")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/ddc231a8-89c1-4ff4-8704-9198bcb41f8d@stanley.mountainSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0ce054f2

ipv6: remove redundant check · d35a3a8f

Xi Huang authored Aug 20, 2024

err varibale will be set everytime,like -ENOBUFS and in if (err < 0),
 when code gets into this path. This check will just slowdown
the execution and that's all.
Signed-off-by: Xi Huang <xuiagnh@gmail.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20240820115442.49366-1-xuiagnh@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d35a3a8f

net: dsa: sja1105: Simplify with scoped for each OF child loop · 2d86ecb6

Jinjie Ruan authored Aug 20, 2024

Use scoped for_each_available_child_of_node_scoped() when iterating over
device nodes to make code a bit simpler.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20240820075047.681223-1-ruanjinjie@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2d86ecb6

net: dsa: ocelot: Simplify with scoped for each OF child loop · e58c3f3d

Jinjie Ruan authored Aug 20, 2024

Use scoped for_each_available_child_of_node_scoped() when iterating over
device nodes to make code a bit simpler.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20240820074805.680674-1-ruanjinjie@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e58c3f3d

nfc: pn533: Avoid -Wflex-array-member-not-at-end warnings · 488d3464

Gustavo A. R. Silva authored Aug 19, 2024

-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Remove unnecessary flex-array member `data[]`, and with this fix
the following warnings:

drivers/nfc/pn533/usb.c:268:38: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
drivers/nfc/pn533/usb.c:275:38: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ZsPw7+6vNoS651Cb@elsantoSigned-off-by: Jakub Kicinski <kuba@kernel.org>

488d3464

21 Aug, 2024 2 commits

net: wwan: t7xx: PCIe reset rescan · d785ed94

Jinjian Song authored Aug 17, 2024

WWAN device is programmed to boot in normal mode or fastboot mode,
when triggering a device reset through ACPI call or fastboot switch
command. Maintain state machine synchronization and reprobe logic
after a device reset.

The PCIe device reset triggered by several ways.
E.g.:
 - fastboot: echo "fastboot_switching" > /sys/bus/pci/devices/${bdf}/t7xx_mode.
 - reset: echo "reset" > /sys/bus/pci/devices/${bdf}/t7xx_mode.
 - IRQ: PCIe device request driver to reset itself by an interrupt request.

Use pci_reset_function() as a generic way to reset device, save and
restore the PCIe configuration before and after reset device to ensure
the reprobe process.

Suggestion from Bjorn:
Link: https://lore.kernel.org/all/20230127133034.GA1364550@bhelgaas/Signed-off-by: Jinjian Song <jinjian.song@fibocom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d785ed94

net: dsa: b53: Use dev_err_probe() · 4d36b2b1

Florian Fainelli authored Aug 19, 2024

Rather than print an error even when we get -EPROBE_DEFER, use
dev_err_probe() to filter out those messages.

Link: https://github.com/openwrt/openwrt/pull/11680Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20240820004436.224603-1-florian.fainelli@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4d36b2b1

20 Aug, 2024 23 commits

l2tp: use skb_queue_purge in l2tp_ip_destroy_sock · bc3dd9ed

James Chapman authored Aug 19, 2024

Recent commit ed8ebee6 ("l2tp: have l2tp_ip_destroy_sock use
ip_flush_pending_frames") was incorrect in that l2tp_ip does not use
socket cork and ip_flush_pending_frames is for sockets that do. Use
__skb_queue_purge instead and remove the unnecessary lock.

Also unexport ip_flush_pending_frames since it was originally exported
in commit 4ff88634 ("ipv4: export ip_flush_pending_frames") for
l2tp and is not used by other modules.

Suggested-by: xiyou.wangcong@gmail.com
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240819143333.3204957-1-jchapman@katalix.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bc3dd9ed

af_unix: Don't call skb_get() for OOB skb. · 8594d9b8

Kuniyuki Iwashima authored Aug 16, 2024

Since introduced, OOB skb holds an additional reference count with no
special reason and caused many issues.

Also, kfree_skb() and consume_skb() are used to decrement the count,
which is confusing.

Let's drop the unnecessary skb_get() in queue_oob() and corresponding
kfree_skb(), consume_skb(), and skb_unref().

Now unix_sk(sk)->oob_skb is just a pointer to skb in the receive queue,
so special handing is no longer needed in GC.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240816233921.57800-1-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8594d9b8

dt-bindings: net: socionext,uniphier-ave4: add top-level constraints · 2862c934

Krzysztof Kozlowski authored Aug 18, 2024

Properties with variable number of items per each device are expected to
have widest constraints in top-level "properties:" block and further
customized (narrowed) in "if:then:". Add missing top-level constraints
for clock-names and reset-names.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20240818172905.121829-4-krzysztof.kozlowski@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2862c934

dt-bindings: net: renesas,etheravb: add top-level constraints · 70d16e13

Krzysztof Kozlowski authored Aug 18, 2024

Properties with variable number of items per each device are expected to
have widest constraints in top-level "properties:" block and further
customized (narrowed) in "if:then:". Add missing top-level constraints
for reg, clocks, clock-names, interrupts and interrupt-names.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20240818172905.121829-3-krzysztof.kozlowski@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

70d16e13

dt-bindings: net: mediatek,net: add top-level constraints · 06ab21c3

Krzysztof Kozlowski authored Aug 18, 2024

Properties with variable number of items per each device are expected to
have widest constraints in top-level "properties:" block and further
customized (narrowed) in "if:then:". Add missing top-level constraints
for clocks and clock-names.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20240818172905.121829-2-krzysztof.kozlowski@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

06ab21c3

dt-bindings: net: mediatek,net: narrow interrupts per variants · 55da77de

Krzysztof Kozlowski authored Aug 18, 2024

Each variable-length property like interrupts must have fixed
constraints on number of items for given variant in binding. The
clauses in "if:then:" block should define both limits: upper and lower.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20240818172905.121829-1-krzysztof.kozlowski@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

55da77de

net: Silence false field-spanning write warning in metadata_dst memcpy · 13cfd6a6

Gal Pressman authored Aug 18, 2024

When metadata_dst struct is allocated (using metadata_dst_alloc()), it
reserves room for options at the end of the struct.

Change the memcpy() to unsafe_memcpy() as it is guaranteed that enough
room (md_size bytes) was allocated and the field-spanning write is
intentional.

This resolves the following warning:
	------------[ cut here ]------------
	memcpy: detected field-spanning write (size 104) of single field "&new_md->u.tun_info" at include/net/dst_metadata.h:166 (size 96)
	WARNING: CPU: 2 PID: 391470 at include/net/dst_metadata.h:166 tun_dst_unclone+0x114/0x138 [geneve]
	Modules linked in: act_tunnel_key geneve ip6_udp_tunnel udp_tunnel act_vlan act_mirred act_skbedit cls_matchall nfnetlink_cttimeout act_gact cls_flower sch_ingress sbsa_gwdt ipmi_devintf ipmi_msghandler xfrm_interface xfrm6_tunnel tunnel6 tunnel4 xfrm_user xfrm_algo nvme_fabrics overlay optee openvswitch nsh nf_conncount ib_srp scsi_transport_srp rpcrdma rdma_ucm ib_iser rdma_cm ib_umad iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm uio_pdrv_genirq uio mlxbf_pmc pwr_mlxbf mlxbf_bootctl bluefield_edac nft_chain_nat binfmt_misc xt_MASQUERADE nf_nat xt_tcpmss xt_NFLOG nfnetlink_log xt_recent xt_hashlimit xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_mark xt_comment ipt_REJECT nf_reject_ipv4 nft_compat nf_tables nfnetlink sch_fq_codel dm_multipath fuse efi_pstore ip_tables btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq raid1 raid0 nvme nvme_core mlx5_ib ib_uverbs ib_core ipv6 crc_ccitt mlx5_core crct10dif_ce mlxfw
	 psample i2c_mlxbf gpio_mlxbf2 mlxbf_gige mlxbf_tmfifo
	CPU: 2 PID: 391470 Comm: handler6 Not tainted 6.10.0-rc1 #1
	Hardware name: https://www.mellanox.com BlueField SoC/BlueField SoC, BIOS 4.5.0.12993 Dec  6 2023
	pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
	pc : tun_dst_unclone+0x114/0x138 [geneve]
	lr : tun_dst_unclone+0x114/0x138 [geneve]
	sp : ffffffc0804533f0
	x29: ffffffc0804533f0 x28: 000000000000024e x27: 0000000000000000
	x26: ffffffdcfc0e8e40 x25: ffffff8086fa6600 x24: ffffff8096a0c000
	x23: 0000000000000068 x22: 0000000000000008 x21: ffffff8092ad7000
	x20: ffffff8081e17900 x19: ffffff8092ad7900 x18: 00000000fffffffd
	x17: 0000000000000000 x16: ffffffdcfa018488 x15: 695f6e75742e753e
	x14: 2d646d5f77656e26 x13: 6d5f77656e262220 x12: 646c65696620656c
	x11: ffffffdcfbe33ae8 x10: ffffffdcfbe1baa8 x9 : ffffffdcfa0a4c10
	x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
	x5 : ffffff83fdeeb010 x4 : 0000000000000000 x3 : 0000000000000027
	x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80913f6780
	Call trace:
	 tun_dst_unclone+0x114/0x138 [geneve]
	 geneve_xmit+0x214/0x10e0 [geneve]
	 dev_hard_start_xmit+0xc0/0x220
	 __dev_queue_xmit+0xa14/0xd38
	 dev_queue_xmit+0x14/0x28 [openvswitch]
	 ovs_vport_send+0x98/0x1c8 [openvswitch]
	 do_output+0x80/0x1a0 [openvswitch]
	 do_execute_actions+0x172c/0x1958 [openvswitch]
	 ovs_execute_actions+0x64/0x1a8 [openvswitch]
	 ovs_packet_cmd_execute+0x258/0x2d8 [openvswitch]
	 genl_family_rcv_msg_doit+0xc8/0x138
	 genl_rcv_msg+0x1ec/0x280
	 netlink_rcv_skb+0x64/0x150
	 genl_rcv+0x40/0x60
	 netlink_unicast+0x2e4/0x348
	 netlink_sendmsg+0x1b0/0x400
	 __sock_sendmsg+0x64/0xc0
	 ____sys_sendmsg+0x284/0x308
	 ___sys_sendmsg+0x88/0xf0
	 __sys_sendmsg+0x70/0xd8
	 __arm64_sys_sendmsg+0x2c/0x40
	 invoke_syscall+0x50/0x128
	 el0_svc_common.constprop.0+0x48/0xf0
	 do_el0_svc+0x24/0x38
	 el0_svc+0x38/0x100
	 el0t_64_sync_handler+0xc0/0xc8
	 el0t_64_sync+0x1a4/0x1a8
	---[ end trace 0000000000000000 ]---
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20240818114351.3612692-1-gal@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

13cfd6a6

net: hns3: Use ARRAY_SIZE() to improve readability · 2cbece60

Zhang Zekun authored Aug 18, 2024

There is a helper function ARRAY_SIZE() to help calculating the
u32 array size, and we don't need to do it mannually. So, let's
use ARRAY_SIZE() to calculate the array size, and improve the code
readability.
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Jijie Shao<shaojijie@huawei.com>
Link: https://patch.msgid.link/20240818052518.45489-1-zhangzekun11@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2cbece60

selftests: net/forwarding: spawn sh inside vrf to speed up ping loop · 555e5531

Jakub Kicinski authored Aug 17, 2024

Looking at timestamped output of netdev CI reveals that
most of the time in forwarding tests for custom route
hashing is spent on a single case, namely the test which
uses ping (mausezahn does not support flow labels).

On a non-debug kernel we spend 714 of 730 total test
runtime (97%) on this test case. While having flow label
support in a traffic gen tool / mausezahn would be best,
we can significantly speed up the loop by putting ip vrf exec
outside of the iteration.

In a test of 1000 pings using a normal loop takes 50 seconds
to finish. While using:

  ip vrf exec $vrf sh -c "$loop-body"

takes 12 seconds (1/4 of the time).

Some of the slowness is likely due to our inefficient virtualization
setup, but even on my laptop running "ip link help" 16k times takes
25-30 seconds, so I think it's worth optimizing even for fastest
setups.
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20240817203659.712085-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

555e5531

net: ethernet: ibm: Simpify code with for_each_child_of_node() · 79765386

Zhang Zekun authored Aug 16, 2024

for_each_child_of_node can help to iterate through the device_node,
and we don't need to use while loop. No functional change with this
conversion.
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240816015837.109627-1-zhangzekun11@huawei.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

79765386

Merge branch 'preparations-for-fib-rule-dscp-selector' · 6b2efdc4

Paolo Abeni authored Aug 20, 2024

Ido Schimmel says:

====================
Preparations for FIB rule DSCP selector

This patchset moves the masking of the upper DSCP bits in 'flowi4_tos'
to the core instead of relying on callers of the FIB lookup API to do
it.

This will allow us to start changing users of the API to initialize the
'flowi4_tos' field with all six bits of the DSCP field. In turn, this
will allow us to extend FIB rules with a new DSCP selector.

By masking the upper DSCP bits in the core we are able to maintain the
behavior of the TOS selector in FIB rules and routes to only match on
the lower DSCP bits.

While working on this I found two users of the API that do not mask the
upper DSCP bits before performing the lookup. The first is an ancient
netlink family that is unlikely to be used. It is adjusted in patch #1
to mask both the upper DSCP bits and the ECN bits before calling the
API.

The second user is a nftables module that differs in this regard from
its equivalent iptables module. It is adjusted in patch #2 to invoke the
API with the upper DSCP bits masked, like all other callers. The
relevant selftest passed, but in the unlikely case that regressions are
reported because of this change, we can restore the existing behavior
using a new flow information flag as discussed here [1].

The last patch moves the masking of the upper DSCP bits to the core,
making the first two patches redundant, but I wanted to post them
separately to call attention to the behavior change for these two users
of the FIB lookup API.

Future patchsets (around 3) will start unmasking the upper DSCP bits
throughout the networking stack before adding support for the new FIB
rule DSCP selector.

Changes from v1 [2]:

Patch #3: Include <linux/ip.h> in <linux/in_route.h> instead of
including it in net/ip_fib.h

[1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/
[2] https://lore.kernel.org/netdev/20240725131729.1729103-1-idosch@nvidia.com/
====================

Link: https://patch.msgid.link/20240814125224.972815-1-idosch@nvidia.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

6b2efdc4

ipv4: Centralize TOS matching · 1fa3314c

Ido Schimmel authored Aug 14, 2024

The TOS field in the IPv4 flow information structure ('flowi4_tos') is
matched by the kernel against the TOS selector in IPv4 rules and routes.
The field is initialized differently by different call sites. Some treat
it as DSCP (RFC 2474) and initialize all six DSCP bits, some treat it as
RFC 1349 TOS and initialize it using RT_TOS() and some treat it as RFC
791 TOS and initialize it using IPTOS_RT_MASK.

What is common to all these call sites is that they all initialize the
lower three DSCP bits, which fits the TOS definition in the initial IPv4
specification (RFC 791).

Therefore, the kernel only allows configuring IPv4 FIB rules that match
on the lower three DSCP bits which are always guaranteed to be
initialized by all call sites:

 # ip -4 rule add tos 0x1c table 100
 # ip -4 rule add tos 0x3c table 100
 Error: Invalid tos.

While this works, it is unlikely to be very useful. RFC 791 that
initially defined the TOS and IP precedence fields was updated by RFC
2474 over twenty five years ago where these fields were replaced by a
single six bits DSCP field.

Extending FIB rules to match on DSCP can be done by adding a new DSCP
selector while maintaining the existing semantics of the TOS selector
for applications that rely on that.

A prerequisite for allowing FIB rules to match on DSCP is to adjust all
the call sites to initialize the high order DSCP bits and remove their
masking along the path to the core where the field is matched on.

However, making this change alone will result in a behavior change. For
example, a forwarded IPv4 packet with a DS field of 0xfc will no longer
match a FIB rule that was configured with 'tos 0x1c'.

This behavior change can be avoided by masking the upper three DSCP bits
in 'flowi4_tos' before comparing it against the TOS selectors in FIB
rules and routes.

Implement the above by adding a new function that checks whether a given
DSCP value matches the one specified in the IPv4 flow information
structure and invoke it from the three places that currently match on
'flowi4_tos'.

Use RT_TOS() for the masking of 'flowi4_tos' instead of IPTOS_RT_MASK
since the latter is not uAPI and we should be able to remove it at some
point.

Include <linux/ip.h> in <linux/in_route.h> since the former defines
IPTOS_TOS_MASK which is used in the definition of RT_TOS() in
<linux/in_route.h>.

No regressions in FIB tests:

 # ./fib_tests.sh
 [...]
 Tests passed: 218
 Tests failed:   0

And FIB rule tests:

 # ./fib_rule_tests.sh
 [...]
 Tests passed: 116
 Tests failed:   0
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

1fa3314c

netfilter: nft_fib: Mask upper DSCP bits before FIB lookup · 548a2029

Ido Schimmel authored Aug 14, 2024

As part of its functionality, the nftables FIB expression module
performs a FIB lookup, but unlike other users of the FIB lookup API, it
does so without masking the upper DSCP bits. In particular, this differs
from the equivalent iptables match ("rpfilter") that does mask the upper
DSCP bits before the FIB lookup.

Align the module to other users of the FIB lookup API and mask the upper
DSCP bits using IPTOS_RT_MASK before the lookup.

No regressions in nft_fib.sh:

 # ./nft_fib.sh
 PASS: fib expression did not cause unwanted packet drops
 PASS: fib expression did drop packets for 1.1.1.1
 PASS: fib expression did drop packets for 1c3::c01d
 PASS: fib expression forward check with policy based routing
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

548a2029

ipv4: Mask upper DSCP bits and ECN bits in NETLINK_FIB_LOOKUP family · 8fed5475

Ido Schimmel authored Aug 14, 2024

The NETLINK_FIB_LOOKUP netlink family can be used to perform a FIB
lookup according to user provided parameters and communicate the result
back to user space.

However, unlike other users of the FIB lookup API, the upper DSCP bits
and the ECN bits of the DS field are not masked, which can result in the
wrong result being returned.

Solve this by masking the upper DSCP bits and the ECN bits using
IPTOS_RT_MASK.

The structure that communicates the request and the response is not
exported to user space, so it is unlikely that this netlink family is
actually in use [1].

[1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

8fed5475

Merge branch 'net-smc-introduce-ringbufs-usage-statistics' · ccb445ae

Paolo Abeni authored Aug 20, 2024

Wen Gu says:

====================
net/smc: introduce ringbufs usage statistics

Currently, we have histograms that show the sizes of ringbufs that ever
used by SMC connections. However, they are always incremental and since
SMC allows the reuse of ringbufs, we cannot know the actual amount of
ringbufs being allocated or actively used.

So this patch set introduces statistics for the amount of ringbufs that
actually allocated by link group and actively used by connections of a
certain net namespace, so that we can react based on these memory usage
information, e.g. active fallback to TCP.

With appropriate adaptations of smc-tools, we can obtain these ringbufs
usage information:

$ smcr -d linkgroup
LG-ID    : 00000500
LG-Role  : SERV
LG-Type  : ASYML
VLAN     : 0
PNET-ID  :
Version  : 1
Conns    : 0
Sndbuf   : 12910592 B    <-
RMB      : 12910592 B    <-

or

$ smcr -d stats
[...]
RX Stats
  Data transmitted (Bytes)      869225943 (869.2M)
  Total requests                 18494479
  Buffer usage  (Bytes)          12910592 (12.31M)  <-
  [...]

TX Stats
  Data transmitted (Bytes)    12760884405 (12.76G)
  Total requests                 36988338
  Buffer usage  (Bytes)          12910592 (12.31M)  <-
  [...]
[...]

Change log:
v3->v2
- use new helper nla_put_uint() instead of nla_put_u64_64bit().

v2->v1
https://lore.kernel.org/r/20240807075939.57882-1-guwen@linux.alibaba.com/
- remove inline keyword in .c files.
- use local variable in macros to avoid potential side effects.

v1
https://lore.kernel.org/r/20240805090551.80786-1-guwen@linux.alibaba.com/
====================

Link: https://patch.msgid.link/20240814130827.73321-1-guwen@linux.alibaba.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

ccb445ae

net/smc: introduce statistics for ringbufs usage of net namespace · e0d10354

Wen Gu authored Aug 14, 2024

The buffer size histograms in smc_stats, namely rx/tx_rmbsize, record
the sizes of ringbufs for all connections that have ever appeared in
the net namespace. They are incremental and we cannot know the actual
ringbufs usage from these. So here introduces statistics for current
ringbufs usage of existing smc connections in the net namespace into
smc_stats, it will be incremented when new connection uses a ringbuf
and decremented when the ringbuf is unused.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

e0d10354

net/smc: introduce statistics for allocated ringbufs of link group · d386d59b

Wen Gu authored Aug 14, 2024

Currently we have the statistics on sndbuf/RMB sizes of all connections
that have ever been on the link group, namely smc_stats_memsize. However
these statistics are incremental and since the ringbufs of link group
are allowed to be reused, we cannot know the actual allocated buffers
through these. So here introduces the statistic on actual allocated
ringbufs of the link group, it will be incremented when a new ringbuf is
added into buf_list and decremented when it is deleted from buf_list.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

d386d59b

net: remove redundant check in skb_shift() · dca9d62a

Zhang Changzhong authored Aug 15, 2024

The check for '!to' is redundant here, since skb_can_coalesce() already
contains this check.
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1723730983-22912-1-git-send-email-zhangchangzhong@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

dca9d62a

mptcp: Remove unused declaration mptcp_sockopt_sync() · af3dc0ad

Yue Haibing authored Aug 16, 2024

Commit a1ab24e5 ("mptcp: consolidate sockopt synchronization")
removed the implementation but leave declaration.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240816100404.879598-1-yuehaibing@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

af3dc0ad

net/mlx5: E-Switch, Remove unused declarations · c5e2a1b0

Yue Haibing authored Aug 16, 2024

These are never implenmented since commit b691b111 ("net/mlx5: Implement
devlink port function cmds to control ipsec_packet").
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240816101550.881844-1-yuehaibing@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c5e2a1b0

igbvf: Remove two unused declarations · 12906bab

Yue Haibing authored Aug 16, 2024

There is no caller and implementations in tree.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240816101638.882072-1-yuehaibing@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

12906bab

gve: Remove unused declaration gve_rx_alloc_rings() · 359c5eb0

Yue Haibing authored Aug 16, 2024

Commit f13697cc ("gve: Switch to config-aware queue allocation")
convert this function to gve_rx_alloc_rings_gqi().
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240816101906.882743-1-yuehaibing@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

359c5eb0

tcp_metrics: use netlink policy for IPv6 addr len validation · a2901083

Jakub Kicinski authored Aug 16, 2024

Use the netlink policy to validate IPv6 address length.
Destination address currently has policy for max len set,
and source has no policy validation. In both cases
the code does the real check. With correct policy
check the code can be removed.
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240816212245.467745-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a2901083

19 Aug, 2024 1 commit

dt-binding: ptp: fsl,ptp: add pci1957,ee02 compatible string for fsl,enetc-ptp · 1bf8e07c

Frank Li authored Aug 14, 2024

fsl,enetc-ptp is embedded pcie device. Add compatible string pci1957,ee02.

Fix warning:
arch/arm64/boot/dts/freescale/fsl-ls1028a-kontron-kbox-a-230-ls.dtb: ethernet@0,4:
	compatible:0: 'pci1957,ee02' is not one of ['fsl,etsec-ptp', 'fsl,fman-ptp-timer', 'fsl,dpaa2-ptp', 'fsl,enetc-ptp']
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

1bf8e07c

17 Aug, 2024 2 commits

bnx2x: Set ivi->vlan field as an integer · a99ef548

Simon Horman authored Aug 15, 2024

In bnx2x_get_vf_config():
* The vlan field of ivi is a 32-bit integer, it is used to store a vlan ID.
* The vlan field of bulletin is a 16-bit integer, it is also used to store
  a vlan ID.

In the current code, ivi->vlan is set using memset. But in the case of
setting it to the value of bulletin->vlan, this involves reading
32 bits from a 16bit source. This is likely safe, as the following
6 bytes are padding in the same structure, but none the less, it seems
undesirable.

However, it is entirely unclear to me how this scheme works on
big-endian systems.

Resolve this by simply assigning integer values to ivi->vlan.

Flagged by W=1 builds.
f.e. gcc-14 reports:

In function 'fortify_memcpy_chk',
    inlined from 'bnx2x_get_vf_config' at .../bnx2x_sriov.c:2655:4:
.../fortify-string.h:580:25: warning: call to '__read_overflow2_field' declared with attribute warning: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Wattribute-warning]
  580 |                         __read_overflow2_field(q_size_field, size);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compile tested only.
Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20240815-bnx2x-int-vlan-v1-1-5940b76e37ad@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a99ef548

mpls: Reduce skb re-allocations due to skb_cow() · f4ae8420

Christoph Paasch authored Aug 15, 2024

mpls_xmit() needs to prepend the MPLS-labels to the packet. That implies
one needs to make sure there is enough space for it in the headers.

Calling skb_cow() implies however that one wants to change even the
playload part of the packet (which is not true for MPLS). Thus, call
skb_cow_head() instead, which is what other tunnelling protocols do.

Running a server with this comm it entirely removed the calls to
pskb_expand_head() from the callstack in mpls_xmit() thus having
significant CPU-reduction, especially at peak times.

Cc: Roopa Prabhu <roopa@nvidia.com>
Reported-by: Craig Taylor <cmtaylor@apple.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240815161201.22021-1-cpaasch@apple.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f4ae8420

16 Aug, 2024 4 commits

net: txgbe: Remove unnecessary NULL check before free · 1c66df86

Simon Horman authored Aug 15, 2024

Remove unnecessary NULL check before freeing using kvfree().
This function will ignore a NULL argument.

Flagged by Coccinelle:

  .../txgbe_hw.c:187:2-8: WARNING: NULL check before some freeing functions is not needed.

No functional change intended.
Compile tested only.
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240815-txgbe-kvfree-v1-1-5ecf8656f555@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1c66df86

net: ethernet: lantiq_etop: remove unused variable · 1f803c95

Aleksander Jan Bajkowski authored Aug 15, 2024

Remove a variable that has never been used.
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Link: https://patch.msgid.link/20240815074956.155224-1-olek2@wp.plSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1f803c95

docs: networking: Align documentation with behavior change · 9480fd0c

Tariq Toukan authored Aug 15, 2024

Following commit 9f7e8fbb ("net/mlx5: offset comp irq index in name by one"),
which fixed the index in IRQ name to start once again from 0, we change
the documentation accordingly.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20240815142343.2254247-1-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9480fd0c

dt-bindings: net: mdio: change nodename match pattern · 02404bdb

Frank Li authored Aug 15, 2024

Change mdio.yaml nodename match pattern to
	'^mdio(-(bus|external))?(@.+|-([0-9]+))$'

Fix mdio.yaml wrong parser mdio controller's address instead phy's address
when mdio-mux exista.

For example:
mdio-mux-emi1@54 {
	compatible = "mdio-mux-mmioreg", "mdio-mux";

        mdio@20 {
		reg = <0x20>;
		       ^^^ This is mdio controller register

		ethernet-phy@2 {
			reg = <0x2>;
                              ^^^ This phy's address
		};
	};
};

Only phy's address is limited to 31 because MDIO bus definition.

But CHECK_DTBS report below warning:

arch/arm64/boot/dts/freescale/fsl-ls1043a-qds.dtb: mdio-mux-emi1@54:
	mdio@20:reg:0:0: 32 is greater than the maximum of 31

The reason is that "mdio-mux-emi1@54" match "nodename: '^mdio(@.*)?'" in
mdio.yaml.

Change to '^mdio(-(bus|external))?(@.+|-([0-9]+))?$' to avoid wrong match
mdio mux controller's node.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20240815163408.4184705-1-Frank.Li@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

02404bdb