Commits · b1711d4310c24f6f668e2282e827434f9473c8bd · Kirill Smelkov / linux

21 Nov, 2023 9 commits

Merge branch 'nfp-add-flow-steering-support' · b1711d43

Jakub Kicinski authored Nov 20, 2023

Louis Peens says:

====================
nfp: add flow-steering support

This short series adds flow steering support for the nfp driver.
The first patch adds the part to communicate with ethtool but
stubs out the HW offload parts. The second patch implements the
HW communication and offloads flow steering.

After this series the user can now use 'ethtool -N/-n' to configure
and display rx classification rules.
====================

Link: https://lore.kernel.org/r/20231117071114.10667-1-louis.peens@corigine.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b1711d43

nfp: offload flow steering to the nfp · c38fb3dc

Yinjun Zhang authored Nov 17, 2023

This is the second part to implement flow steering. Mailbox is used
for the communication between driver and HW.
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231117071114.10667-3-louis.peens@corigine.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c38fb3dc

nfp: add ethtool flow steering callbacks · 9eb03bb1

Yinjun Zhang authored Nov 17, 2023

This is the first part to implement flow steering. The communication
between ethtool and driver is done. User can use following commands
to display and set flows:

ethtool -n <netdev>
ethtool -N <netdev> flow-type ...
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20231117071114.10667-2-louis.peens@corigine.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9eb03bb1

Merge branch 'net-axienet-introduce-dmaengine' · 21612f52

Jakub Kicinski authored Nov 20, 2023

Radhey Shyam Pandey says:

====================
net: axienet: Introduce dmaengine

The axiethernet driver can use the dmaengine framework to communicate
with the xilinx DMAengine driver(AXIDMA, MCDMA). The inspiration behind
this dmaengine adoption is to reuse the in-kernel xilinx dma engine
driver[1] and remove redundant dma programming sequence[2] from the
ethernet driver. This simplifies the ethernet driver and also makes
it generic to be hooked to any complaint dma IP i.e AXIDMA, MCDMA
without any modification.

The dmaengine framework was extended for metadata API support during
the axidma RFC[3] discussion. However, it still needs further
enhancements to make it well suited for ethernet usecases.

Comments, suggestions, thoughts to implement remaining functional
features are very welcome!

[1]: https://github.com/torvalds/linux/blob/master/drivers/dma/xilinx/xilinx_dma.c
[2]: https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/xilinx/xilinx_axienet_main.c#L238
[3]: http://lkml.iu.edu/hypermail/linux/kernel/1804.0/00367.html
[4]: https://lore.kernel.org/all/20221124102745.2620370-1-sarath.babu.naidu.gaddam@amd.com
====================

Link: https://lore.kernel.org/r/1700074613-1977070-1-git-send-email-radhey.shyam.pandey@amd.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

21612f52

net: axienet: Introduce dmaengine support · 6a91b846

Radhey Shyam Pandey authored Nov 16, 2023

Add dmaengine framework to communicate with the xilinx DMAengine
driver(AXIDMA).

Axi ethernet driver uses separate channels for transmit and receive.
Add support for these channels to handle TX and RX with skb and
appropriate callbacks. Also add axi ethernet core interrupt for
dmaengine framework support.

The dmaengine framework was extended for metadata API support.
However it still needs further enhancements to make it well suited for
ethernet usecases. The ethernet features i.e ethtool set/get of DMA IP
properties, ndo_poll_controller,(mentioned in TODO) are not supported
and it requires follow-up discussions.

dmaengine support has a dependency on xilinx_dma as it uses
xilinx_vdma_channel_set_config() API to reset the DMA IP
which internally reset MAC prior to accessing MDIO.

Benchmark with netperf:

xilinx-zcu102-20232:~$ netperf -H 192.168.10.20 -t TCP_STREAM
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 192.168.10.20 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

131072 16384 16384 10.02 886.69

xilinx-zcu102-20232:~$ netperf -H 192.168.10.20 -t UDP_STREAM
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 192.168.10.20 () port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

212992 65507 10.00 15851 0 830.66
212992 10.00 15851 830.66
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Link: https://lore.kernel.org/r/1700074613-1977070-4-git-send-email-radhey.shyam.pandey@amd.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6a91b846

net: axienet: Preparatory changes for dmaengine support · 6b1b40f7

Sarath Babu Naidu Gaddam authored Nov 16, 2023

The axiethernet driver has inbuilt dma programming. In order to add
dmaengine support and make it's integration seamless the current axidma
inbuilt programming code is put under use_dmaengine check.

It also performs minor code reordering to minimize conditional
use_dmaengine checks and there is no functional change. It uses
"dmas" property to identify whether it should use a dmaengine
framework or inbuilt axidma programming.
Signed-off-by: Sarath Babu Naidu Gaddam <sarath.babu.naidu.gaddam@amd.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Link: https://lore.kernel.org/r/1700074613-1977070-3-git-send-email-radhey.shyam.pandey@amd.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6b1b40f7

dt-bindings: net: xlnx,axi-ethernet: Introduce DMA support · 5e63c5ef

Radhey Shyam Pandey authored Nov 16, 2023

Xilinx 1G/2.5G Ethernet Subsystem provides 32-bit AXI4-Stream buses to
move transmit and receive Ethernet data to and from the subsystem.

These buses are designed to be used with an AXI Direct Memory Access(DMA)
IP or AXI Multichannel Direct Memory Access (MCDMA) IP core, AXI4-Stream
Data FIFO, or any other custom logic in any supported device.

Primary high-speed DMA data movement between system memory and stream
target is through the AXI4 Read Master to AXI4 memory-mapped to stream
(MM2S) Master, and AXI stream to memory-mapped (S2MM) Slave to AXI4
Write Master. AXI DMA/MCDMA enables channel of data movement on both
MM2S and S2MM paths in scatter/gather mode.

AXI DMA has two channels where as MCDMA has 16 Tx and 16 Rx channels.
To uniquely identify each channel use 'chan' suffix. Depending on the
usecase AXI ethernet driver can request any combination of multichannel
DMA channels using generic dmas, dma-names properties.

Example:
dma-names = tx_chan0, rx_chan0, tx_chan1, rx_chan1;
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/1700074613-1977070-2-git-send-email-radhey.shyam.pandey@amd.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

5e63c5ef

selftests: net: verify fq per-band packet limit · a0bc96c0

Willem de Bruijn authored Nov 16, 2023

Commit 29f834aa ("net_sched: sch_fq: add 3 bands and WRR
scheduling") introduces multiple traffic bands, and per-band maximum
packet count.

Per-band limits ensures that packets in one class cannot fill the
entire qdisc and so cause DoS to the traffic in the other classes.

Verify this behavior:
1. set the limit to 10 per band
2. send 20 pkts on band A: verify that 10 are queued, 10 dropped
3. send 20 pkts on band A: verify that 0 are queued, 20 dropped
4. send 20 pkts on band B: verify that 10 are queued, 10 dropped

Packets must remain queued for a period to trigger this behavior.
Use SO_TXTIME to store packets for 100 msec.

The test reuses existing upstream test infra. The script is a fork of
cmsg_time.sh. The scripts call cmsg_sender.

The test extends cmsg_sender with two arguments:

* '-P' SO_PRIORITY
There is a subtle difference between IPv4 and IPv6 stack behavior:
PF_INET/IP_TOS sets IP header bits and sk_priority
PF_INET6/IPV6_TCLASS sets IP header bits BUT NOT sk_priority

* '-n' num pkts
Send multiple packets in quick succession.
I first attempted a for loop in the script, but this is too slow in
virtualized environments, causing flakiness as the 100ms timeout is
reached and packets are dequeued.

Also do not wait for timestamps to be queued unless timestamps are
requested.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231116203449.2627525-1-willemdebruijn.kernel@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a0bc96c0

net: microchip: lan743x : bidirectional throughput improvement · 45933b2d

Vishvambar Panth S authored Nov 16, 2023

The LAN743x/PCI11xxx DMA descriptors are always 4 dwords long, but the
device supports placing the descriptors in memory back to back or
reserving space in between them using its DMA_DESCRIPTOR_SPACE (DSPACE)
configurable hardware setting. Currently DSPACE is unnecessarily set to
match the host's L1 cache line size, resulting in space reserved in
between descriptors in most platforms and causing a suboptimal behavior
(single PCIe Mem transaction per descriptor). By changing the setting
to DSPACE=16 many descriptors can be packed in a single PCIe Mem
transaction resulting in a massive performance improvement in
bidirectional tests without any negative effects.
Tested and verified improvements on x64 PC and several ARM platforms
(typical data below)

Test setup 1: x64 PC with LAN7430 ---> x64 PC

iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate
[  5][TX-C]   0.00-10.00  sec   170 MBytes   143 Mbits/sec  sender
[  5][TX-C]   0.00-10.04  sec   169 MBytes   141 Mbits/sec  receiver
[  7][RX-C]   0.00-10.00  sec  1.02 GBytes   876 Mbits/sec  sender
[  7][RX-C]   0.00-10.04  sec  1.02 GBytes   870 Mbits/sec  receiver

iperf3 UDP bidirectional with DSPACE set to 16 Bytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate
[  5][TX-C]   0.00-10.00  sec  1.11 GBytes   956 Mbits/sec  sender
[  5][TX-C]   0.00-10.04  sec  1.11 GBytes   951 Mbits/sec  receiver
[  7][RX-C]   0.00-10.00  sec  1.10 GBytes   948 Mbits/sec  sender
[  7][RX-C]   0.00-10.04  sec  1.10 GBytes   942 Mbits/sec  receiver

Test setup 2 : RK3399 with LAN7430 ---> x64 PC

RK3399 Spec:
The SOM-RK3399 is ARM module designed and developed by FriendlyElec.
Cores: 64-bit Dual Core Cortex-A72 + Quad Core Cortex-A53
Frequency: Cortex-A72(up to 2.0GHz), Cortex-A53(up to 1.5GHz)
PCIe: PCIe x4, compatible with PCIe 2.1, Dual operation mode

iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate
[  5][TX-C]   0.00-10.00  sec   534 MBytes   448 Mbits/sec  sender
[  5][TX-C]   0.00-10.05  sec   534 MBytes   446 Mbits/sec  receiver
[  7][RX-C]   0.00-10.00  sec  1.12 GBytes   961 Mbits/sec  sender
[  7][RX-C]   0.00-10.05  sec  1.11 GBytes   946 Mbits/sec  receiver

iperf3 UDP bidirectional with DSPACE set to 16 Bytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval           Transfer     Bitrate
[  5][TX-C]   0.00-10.00  sec   966 MBytes   810 Mbits/sec   sender
[  5][TX-C]   0.00-10.04  sec   965 MBytes   806 Mbits/sec   receiver
[  7][RX-C]   0.00-10.00  sec  1.11 GBytes   956 Mbits/sec   sender
[  7][RX-C]   0.00-10.04  sec  1.07 GBytes   919 Mbits/sec   receiver
Signed-off-by: Vishvambar Panth S <vishvambarpanth.s@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20231116054350.620420-1-vishvambarpanth.s@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

45933b2d

19 Nov, 2023 13 commits

net: ethernet: mtk_wed: rely on __dev_alloc_page in mtk_wed_tx_buffer_alloc · 94c81c62

Lorenzo Bianconi authored Nov 17, 2023

Simplify the code and use __dev_alloc_page() instead of __dev_alloc_pages()
with order 0 in mtk_wed_tx_buffer_alloc routine
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

94c81c62

Merge branch 'am65-cpsw-ethtool-mac-stats' · 69d5ee8c

David S. Miller authored Nov 19, 2023

Roger Quadros says:

===================
net: eth: am65-cpsw: add ethtool MAC stats

Gets 'ethtool -S eth0 --groups eth-mac' command to work.

Also set default TX channels to maximum available and does
cleanup in am65_cpsw_nuss_common_open() error path.

Changelog:
v2:
- add __iomem to *stats, to prevent sparse warning
- clean up RX descriptors and free up SKB in error handling of
  am65_cpsw_nuss_common_open()
- Re-arrange some funcitons to avoid forward declaration
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

69d5ee8c

net: ethernet: ti: am65-cpsw: Fix error handling in am65_cpsw_nuss_common_open() · ebd7bf60

Roger Quadros authored Nov 17, 2023

k3_udma_glue_enable_rx/tx_chn returns error code on failure.
Bail out on error while enabling TX/RX channel.

In the error path, clean up the RX descriptors and SKBs.
Get rid of kmemleak_not_leak() as it seems unnecessary now.

Fixes: 93a76530 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ebd7bf60

net: ethernet: am65-cpsw: Set default TX channels to maximum · be397ea3

Roger Quadros authored Nov 17, 2023

am65-cpsw supports 8 TX hardware queues. Set this as default.

The rationale is that some am65-cpsw devices can have up to 4 ethernet
ports. If the number of TX channels have to be changed then all
interfaces have to be brought down and up as the old default of 1
TX channel is too restrictive for any mqprio/taprio usage.

Another reason for this change is to allow testing using
kselftest:net/forwarding:ethtool_mm.sh out of the box.
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

be397ea3

net: ethernet: ti: am65-cpsw: Re-arrange functions to avoid forward declaration · ac099466

Roger Quadros authored Nov 17, 2023

Re-arrange am65_cpsw_nuss_rx_cleanup(), am65_cpsw_nuss_xmit_free() and
am65_cpsw_nuss_tx_cleanup() to avoid forward declaration.

No functional change.
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

ac099466

net: ethernet: am65-cpsw: Add standard Ethernet MAC stats to ethtool · 67372d7a

Roger Quadros authored Nov 17, 2023

Gets 'ethtool -S eth0 --groups eth-mac' command to work.
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

67372d7a

rtnetlink: introduce nlmsg_new_large and use it in rtnl_getlink · ac40916a

Li RongQing authored Nov 15, 2023

if a PF has 256 or more VFs, ip link command will allocate an order 3
memory or more, and maybe trigger OOM due to memory fragment,
the VFs needed memory size is computed in rtnl_vfinfo_size.

so introduce nlmsg_new_large which calls netlink_alloc_large_skb in
which vmalloc is used for large memory, to avoid the failure of
allocating memory

    ip invoked oom-killer: gfp_mask=0xc2cc0(GFP_KERNEL|__GFP_NOWARN|\
	__GFP_COMP|__GFP_NOMEMALLOC), order=3, oom_score_adj=0
    CPU: 74 PID: 204414 Comm: ip Kdump: loaded Tainted: P           OE
    Call Trace:
    dump_stack+0x57/0x6a
    dump_header+0x4a/0x210
    oom_kill_process+0xe4/0x140
    out_of_memory+0x3e8/0x790
    __alloc_pages_slowpath.constprop.116+0x953/0xc50
    __alloc_pages_nodemask+0x2af/0x310
    kmalloc_large_node+0x38/0xf0
    __kmalloc_node_track_caller+0x417/0x4d0
    __kmalloc_reserve.isra.61+0x2e/0x80
    __alloc_skb+0x82/0x1c0
    rtnl_getlink+0x24f/0x370
    rtnetlink_rcv_msg+0x12c/0x350
    netlink_rcv_skb+0x50/0x100
    netlink_unicast+0x1b2/0x280
    netlink_sendmsg+0x355/0x4a0
    sock_sendmsg+0x5b/0x60
    ____sys_sendmsg+0x1ea/0x250
    ___sys_sendmsg+0x88/0xd0
    __sys_sendmsg+0x5e/0xa0
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f95a65a5b70

Cc: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20231115120108.3711-1-lirongqing@baidu.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ac40916a

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 459a70ba

Jakub Kicinski authored Nov 18, 2023

Tony Nguyen says:

====================
ice: one by one port representors creation

Michal Swiatkowski says:

Currently ice supports creating port representors only for VFs. For that
use case they can be created and removed in one step.

This patchset is refactoring current flow to support port representor
creation also for subfunctions and SIOV. In this case port representors
need to be created and removed one by one. Also, they can be added and
removed while other port representors are running.

To achieve that we need to change the switchdev configuration flow.
Three first patches are only cosmetic (renaming, removing not used code).
Next few ones are preparation for new flow. The most important one
is "add VF representor one by one". It fully implements new flow.

New type of port representor (for subfunction) will be introduced in
follow up patchset.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: reserve number of CP queues
  ice: adjust switchdev rebuild path
  ice: add VF representors one by one
  ice: realloc VSI stats arrays
  ice: set Tx topology every time new repr is added
  ice: allow changing SWITCHDEV_CTRL VSI queues
  ice: return pointer to representor
  ice: make representor code generic
  ice: remove VF pointer reference in eswitch code
  ice: track port representors in xarray
  ice: use repr instead of vf->repr
  ice: track q_id in representor
  ice: remove unused control VSI parameter
  ice: remove redundant max_vsi_num variable
  ice: rename switchdev to eswitch
====================

Link: https://lore.kernel.org/r/20231114181449.1290117-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

459a70ba

Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · a49296e0

Jakub Kicinski authored Nov 18, 2023

Tony Nguyen says:

====================
igc: Add support for physical + free-running timers

Vinicius Costa Gomes says:

The objective is to allow having functionality that depends on the
physical timer (taprio and ETF offloads, for example) and vclocks
operating together.

The "big" missing piece is the implementation of the .getcyclesx64()
function in igc, as i225/i226 have multiple timers, we use one of
those timers (timer 1) as a free-running (non adjustable) timer.

The complication is that only implementing .getcyclesx64() and nothing
else will break synchronization when using vclocks, as reading the clock
will retrieve the free-running value but timnestamps will come from the
adjustable timer. The solution is to modify "in one go" the timestamping
code to be able to retrieve the timestamp from the correct timer (if a
socket is "phc_bound" to a vclock the timestamp will come from the
free-running timer).

I was debating whether or not to do the adjustments for the internal latencies
for the free-running timestamps, decided to do the adjustments so the path
delay when using vclocks is similar to the one when using the physical clock.

One future improvement is to implement the .getcrosscycles() function.

* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  igc: Add support for PTP .getcyclesx64()
  igc: Simplify setting flags in the TX data descriptor
====================

Link: https://lore.kernel.org/r/20231114183640.1303163-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a49296e0

Merge branch 'net-sched-cls_u32-use-proper-refcounts' · 516cba96

Jakub Kicinski authored Nov 18, 2023

Pedro Tammela says:

====================
net/sched: cls_u32: use proper refcounts

In u32 we are open coding refcounts of hashtables with integers which is
far from ideal. Update those with proper refcount and add a couple of
tests to tdc that exercise the refcounts explicitly.
====================

Link: https://lore.kernel.org/r/20231114141856.974326-1-pctammela@mojatatu.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

516cba96

selftests/tc-testing: add hashtable tests for u32 · 54293e4d

Pedro Tammela authored Nov 14, 2023

Add tests to specifically check for the refcount interactions of
hashtables created by u32. These tables should not be deleted when
referenced and the flush order should respect a tree like composition.
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231114141856.974326-3-pctammela@mojatatu.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

54293e4d

net/sched: cls_u32: replace int refcounts with proper refcounts · 6b78debe

Pedro Tammela authored Nov 14, 2023

Proper refcounts will always warn splat when something goes wrong,
be it underflow, saturation or object resurrection. As these are always
a source of bugs, use it in cls_u32 as a safeguard to prevent/catch issues.
Another benefit is that the refcount API self documents the code, making
clear when transitions to dead are expected.

For such an update we had to make minor adaptations on u32 to fit the refcount
API. First we set explicitly to '1' when objects are created, then the
objects are alive until a 1 -> 0 happens, which is then released appropriately.

The above made clear some redundant operations in the u32 code
around the root_ht handling that were removed. The root_ht is created
with a refcnt set to 1. Then when it's associated with tcf_proto it increments the refcnt to 2.
Throughout the entire code the root_ht is an exceptional case and can never be referenced,
therefore the refcnt never incremented/decremented.
Its lifetime is always bound to tcf_proto, meaning if you delete tcf_proto
the root_ht is deleted as well. The code made up for the fact that root_ht refcnt is 2 and did
a double decrement to free it, which is not a fit for the refcount API.

Even though refcount_t is implemented using atomics, we should observe
a negligible control plane impact.
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231114141856.974326-2-pctammela@mojatatu.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6b78debe

net: partial revert of the "Make timestamping selectable: series · 289354f2

Jakub Kicinski authored Nov 18, 2023

Revert following commits:

commit acec05fb ("net_tstamp: Add TIMESTAMPING SOFTWARE and HARDWARE mask")
commit 11d55be0 ("net: ethtool: Add a command to expose current time stamping layer")
commit bb8645b0 ("netlink: specs: Introduce new netlink command to get current timestamp")
commit d905f9c7 ("net: ethtool: Add a command to list available time stamping layers")
commit aed5004e ("netlink: specs: Introduce new netlink command to list available time stamping layers")
commit 51bdf316 ("net: Replace hwtstamp_source by timestamping layer")
commit 0f7f463d ("net: Change the API of PHY default timestamp to MAC")
commit 091fab12 ("net: ethtool: ts: Update GET_TS to reply the current selected timestamp")
commit 152c75e1 ("net: ethtool: ts: Let the active time stamping layer be selectable")
commit ee60ea6b ("netlink: specs: Introduce time stamping set command")

They need more time for reviews.

Link: https://lore.kernel.org/all/20231118183529.6e67100c@kernel.org/Signed-off-by: Jakub Kicinski <kuba@kernel.org>

289354f2

18 Nov, 2023 18 commits

r8169: improve RTL8411b phy-down fixup · 055dd751

Heiner Kallweit authored Nov 13, 2023

Mirsad proposed a patch to reduce the number of spinlock lock/unlock
operations and the function code size. This can be further improved
because the function sets a consecutive register block.
Suggested-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Mirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: David S. Miller <davem@davemloft.net>

055dd751

Merge tag 'mlx5-updates-2023-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · ce30df20

David S. Miller authored Nov 18, 2023

mlx5-updates-2023-11-13

1) Cleanup patches, leftovers from previous cycle

2) Allow sync reset flow when BF MGT interface device is present

3) Trivial ptp refactorings and improvements

4) Add local loopback counter to vport rep stats
Signed-off-by: David S. Miller <davem@davemloft.net>

ce30df20

Merge tag 'batadv-next-pullrequest-20231115' of git://git.open-mesh.org/linux-merge · 39620a35

David S. Miller authored Nov 18, 2023

This feature/cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - Implement new multicast packet type, including its transmission,
   forwarding and parsing, by Linus Lüssing (3 patches)

 - Switch to new headers for sprintf and array size,
   by Sven Eckelmann (2 patches)
Signed-off-by: David S. Miller <davem@davemloft.net>

39620a35

Merge branch 'mlxsw-new-reset-flow' · 72a813a4

David S. Miller authored Nov 18, 2023

Petr Machata says:

====================
mlxsw: Add support for new reset flow

Ido Schimmel writes:

This patchset changes mlxsw to issue a PCI reset during probe and
devlink reload so that the PCI firmware could be upgraded without a
reboot.

Unlike the old version of this patchset [1], in this version the driver
no longer tries to issue a PCI reset by triggering a PCI link toggle on
its own, but instead calls the PCI core to issue the reset.

The PCI APIs require the device lock to be held which is why patches

Patches #7 adds reset method quirk for NVIDIA Spectrum devices.

Patch #8 adds a debug level print in PCI core so that device ready delay
will be printed even if it is shorter than one second.

Patches #9-#11 are straightforward preparations in mlxsw.

Patch #12 finally implements the new reset flow in mlxsw.

Patch #13 adds PCI reset handlers in mlxsw to avoid user space from
resetting the device from underneath an unaware driver. Instead, the
driver is gracefully de-initialized before the PCI reset and then
initialized again after it.

Patch #14 adds a PCI reset selftest to make sure this code path does not
regress.

[1] https://lore.kernel.org/netdev/cover.1679502371.git.petrm@nvidia.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

72a813a4

selftests: mlxsw: Add PCI reset test · af51d6bd

Ido Schimmel authored Nov 15, 2023

Test that PCI reset works correctly by verifying that only the expected
reset methods are supported and that after issuing the reset the ifindex
of the port changes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

af51d6bd

mlxsw: pci: Implement PCI reset handlers · 5e12d089

Ido Schimmel authored Nov 15, 2023

Implement reset_prepare() and reset_done() handlers that are invoked by
the PCI core before and after issuing a PCI reset, respectively.

Specifically, implement reset_prepare() by calling
mlxsw_core_bus_device_unregister() and reset_done() by calling
mlxsw_core_bus_device_register(). This is the same implementation as the
reload_{down,up}() devlink operations with the following differences:

1. The devlink instance is unregistered and then registered again after
   the reset.

2. A reset via the device's command interface (using MRSR register) is
   not issued during reset_done() as PCI core already issued a PCI
   reset.

Tested:

 # for i in $(seq 1 10); do echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset; done
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e12d089

mlxsw: pci: Add support for new reset flow · f257c73e

Ido Schimmel authored Nov 15, 2023

The driver resets the device during probe and during a devlink reload.
The current reset method reloads the current firmware version or a
pending one, if one was previously flashed using devlink. However, the
current reset method does not result in a PCI hot reset, preventing the
PCI firmware from being upgraded, unless the system is rebooted.

To solve this problem, a new reset command (6) was implemented in the
firmware. Unlike the current command (1), after issuing the new command
the device will not start the reset immediately, but only after a PCI
hot reset.

Implement the new reset method by first verifying that it is supported
by the current firmware version by querying the Management Capabilities
Mask (MCAM) register. If supported, issue the new reset command (6) via
MRSR register followed by a PCI reset by calling
__pci_reset_function_locked().

Once the PCI firmware is operational, go back to the regular reset flow
and wait for the entire device to become ready. That is, repeatedly read
the "system_status" register from the BAR until a value of "FW_READY"
(0x5E) appears.

Tested:

 # for i in $(seq 1 10); do devlink dev reload pci/0000:01:00.0; done
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

f257c73e

mlxsw: pci: Move software reset code to a separate function · 8d9da467

Amit Cohen authored Nov 15, 2023

In general, the existing flow of software reset in the driver is:
1. Wait for system ready status.
2. Send MRSR command, to start the reset.
3. Wait for system ready status.

This flow will be extended once a new reset command is supported. As a
preparation, move step #2 to a separate function.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

8d9da467

mlxsw: pci: Rename mlxsw_pci_sw_reset() · bdf85f3a

Amit Cohen authored Nov 15, 2023

In the next patches, mlxsw_pci_sw_reset() will be extended to support
more reset types and will not necessarily issue a software reset. Rename
the function to reflect that.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

bdf85f3a

mlxsw: Extend MRSR pack() function to support new commands · e6dbab40

Amit Cohen authored Nov 15, 2023

Currently mlxsw_reg_mrsr_pack() always sets 'command=1'. As preparation for
support of new reset flow, pass the command as an argument to the
function and add an enum for this field.

For now, always pass 'command=1' to the pack() function.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

e6dbab40

PCI: Add debug print for device ready delay · 0a5ef959

Ido Schimmel authored Nov 15, 2023

Currently, the time it took a PCI device to become ready after reset is
only printed if it was longer than 1000ms ('PCI_RESET_WAIT'). However,
for debugging purposes it is useful to know this time even if it was
shorter. For example, with the device I am working on, hardware
engineers asked to verify that it becomes ready on the first try (no
delay).

To that end, add a debug level print that can be enabled using dynamic
debug. Example:

 # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
 # dmesg -c | grep ready
 # echo "file drivers/pci/pci.c +p" > /sys/kernel/debug/dynamic_debug/control
 # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
 # dmesg -c | grep ready
 [  396.060335] mlxsw_spectrum4 0000:01:00.0: ready 0ms after bus reset
 # echo "file drivers/pci/pci.c -p" > /sys/kernel/debug/dynamic_debug/control
 # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
 # dmesg -c | grep ready
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

0a5ef959

PCI: Add no PM reset quirk for NVIDIA Spectrum devices · 3ed48c80

Ido Schimmel authored Nov 15, 2023

Spectrum-{1,2,3,4} devices report that a D3hot->D0 transition causes a
reset (i.e., they advertise NoSoftRst-). However, this transition does
not have any effect on the device: It continues to be operational and
network ports remain up. Advertising this support makes it seem as if a
PM reset is viable for these devices. Mark it as unavailable to skip it
when testing reset methods.

Before:

 # cat /sys/bus/pci/devices/0000\:03\:00.0/reset_method
 pm bus

After:

 # cat /sys/bus/pci/devices/0000\:03\:00.0/reset_method
 bus
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3ed48c80

devlink: Add device lock assert in reload operation · 527a07e1

Ido Schimmel authored Nov 15, 2023

Add an assert to verify that the device lock is always held throughout
reload operations.

Tested the following flows with netdevsim and mlxsw while lockdep is
enabled:

netdevsim:

 # echo "10 1" > /sys/bus/netdevsim/new_device
 # devlink dev reload netdevsim/netdevsim10
 # ip netns add bla
 # devlink dev reload netdevsim/netdevsim10 netns bla
 # ip netns del bla
 # echo 10 > /sys/bus/netdevsim/del_device

mlxsw:

 # devlink dev reload pci/0000:01:00.0
 # ip netns add bla
 # devlink dev reload pci/0000:01:00.0 netns bla
 # ip netns del bla
 # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
 # echo 1 > /sys/bus/pci/rescan
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

527a07e1

devlink: Acquire device lock during reload command · bf6b200b

Ido Schimmel authored Nov 15, 2023

Device drivers register with devlink from their probe routines (under
the device lock) by acquiring the devlink instance lock and calling
devl_register().

Drivers that support a devlink reload usually implement the
reload_{down, up}() operations in a similar fashion to their remove and
probe routines, respectively.

However, while the remove and probe routines are invoked with the device
lock held, the reload operations are only invoked with the devlink
instance lock held. It is therefore impossible for drivers to acquire
the device lock from their reload operations, as this would result in
lock inversion.

The motivating use case for invoking the reload operations with the
device lock held is in mlxsw which needs to trigger a PCI reset as part
of the reload. The driver cannot call pci_reset_function() as this
function acquires the device lock. Instead, it needs to call
__pci_reset_function_locked which expects the device lock to be held.

To that end, adjust devlink to always acquire the device lock before the
devlink instance lock when performing a reload.

Do that when reload is explicitly triggered by user space by specifying
the 'DEVLINK_NL_FLAG_NEED_DEV_LOCK' flag in the pre_doit and post_doit
operations of the reload command.

A previous patch already handled the case where reload is invoked as
part of netns dismantle.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

bf6b200b

devlink: Allow taking device lock in pre_doit operations · d32c3825

Ido Schimmel authored Nov 15, 2023

Introduce a new private flag ('DEVLINK_NL_FLAG_NEED_DEV_LOCK') to allow
netlink commands to specify that they need to acquire the device lock in
their pre_doit operation and release it in their post_doit operation.

The reload command will use this flag in the subsequent patch.

No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

d32c3825

devlink: Enable the use of private flags in post_doit operations · c8d0a7d6

Ido Schimmel authored Nov 15, 2023

Currently, private flags (e.g., 'DEVLINK_NL_FLAG_NEED_PORT') are only
used in pre_doit operations, but a subsequent patch will need to
conditionally lock and unlock the device lock in pre and post doit
operations, respectively.

As a preparation, enable the use of private flags in post_doit
operations in a similar fashion to how it is done for pre_doit
operations.

No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

c8d0a7d6

devlink: Acquire device lock during netns dismantle · e21c52d7

Ido Schimmel authored Nov 15, 2023

Device drivers register with devlink from their probe routines (under
the device lock) by acquiring the devlink instance lock and calling
devl_register().

Drivers that support a devlink reload usually implement the
reload_{down, up}() operations in a similar fashion to their remove and
probe routines, respectively.

However, while the remove and probe routines are invoked with the device
lock held, the reload operations are only invoked with the devlink
instance lock held. It is therefore impossible for drivers to acquire
the device lock from their reload operations, as this would result in
lock inversion.

The motivating use case for invoking the reload operations with the
device lock held is in mlxsw which needs to trigger a PCI reset as part
of the reload. The driver cannot call pci_reset_function() as this
function acquires the device lock. Instead, it needs to call
__pci_reset_function_locked which expects the device lock to be held.

To that end, adjust devlink to always acquire the device lock before the
devlink instance lock when performing a reload.

For now, only do that when reload is triggered as part of netns
dismantle. Subsequent patches will handle the case where reload is
explicitly triggered by user space.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

e21c52d7

devlink: Move private netlink flags to C file · 526dd6d7

Ido Schimmel authored Nov 15, 2023

The flags are not used outside of the C file so move them there.
Suggested-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

526dd6d7