Commits · bec0b7a2db358718a371e6fe0103220685552fae · Kirill Smelkov / linux

29 Mar, 2023 9 commits

tools: ynl: Add struct parsing to nlspec · bec0b7a2

Donald Hunter authored Mar 27, 2023

Add python classes for struct definitions to nlspec
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bec0b7a2

Merge tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · de749452

Jakub Kicinski authored Mar 28, 2023

Saeed Mahameed says:

====================
mlx5-updates-2023-03-20

mlx5 dynamic msix

This patch series adds support for dynamic msix vectors allocation in mlx5.

Eli Cohen Says:

================

The following series of patches modifies mlx5_core to work with the
dynamic MSIX API. Currently, mlx5_core allocates all the interrupt
vectors it needs and distributes them amongst the consumers. With the
introduction of dynamic MSIX support, which allows for allocation of
interrupts more than once, we now allocate vectors as we need them.
This allows other drivers running on top of mlx5_core to allocate
interrupt vectors for their own use. An example for this is mlx5_vdpa,
which uses these vectors to propagate interrupts directly from the
hardware to the vCPU [1].

As a preparation for using this series, a use after free issue is fixed
in lib/cpu_rmap.c and the allocator for rmap entries has been modified.
A complementary API for irq_cpu_rmap_add() has also been introduced.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/patch/?id=0f2bf1fcae96a83b8c5581854713c9fc3407556e

================

* tag 'mlx5-updates-2023-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: Provide external API for allocating vectors
  net/mlx5: Use one completion vector if eth is disabled
  net/mlx5: Refactor calculation of required completion vectors
  net/mlx5: Move devlink registration before mlx5_load
  net/mlx5: Use dynamic msix vectors allocation
  net/mlx5: Refactor completion irq request/release code
  net/mlx5: Improve naming of pci function vectors
  net/mlx5: Use newer affinity descriptor
  net/mlx5: Modify struct mlx5_irq to use struct msi_map
  net/mlx5: Fix wrong comment
  net/mlx5e: Coding style fix, add empty line
  lib: cpu_rmap: Add irq_cpu_rmap_remove to complement irq_cpu_rmap_add
  lib: cpu_rmap: Use allocator for rmap entries
  lib: cpu_rmap: Avoid use after free on rmap->obj array entries
====================

Link: https://lore.kernel.org/r/20230324231341.29808-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

de749452

docs: netdev: clarify the need to sending reverts as patches · e70f94c6

Jakub Kicinski authored Mar 27, 2023

We don't state explicitly that reverts need to be submitted
as a patch. It occasionally comes up.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230327172646.2622943-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e70f94c6

net: ethernet: 8390: axnet_cs: remove unused xfer_count variable · e48cefb9

Tom Rix authored Mar 27, 2023

clang with W=1 reports
drivers/net/ethernet/8390/axnet_cs.c:653:9: error: variable
  'xfer_count' set but not used [-Werror,-Wunused-but-set-variable]
    int xfer_count = count;
        ^
This variable is not used so remove it.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230327235423.1777590-1-trix@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e48cefb9

Revert "sh_eth: remove open coded netif_running()" · cdeccd13

Wolfram Sang authored Mar 27, 2023

This reverts commit ce1fdb06. It turned
out this actually introduces a race condition. netif_running() is not a
suitable check for get_stats.
Reported-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230327152112.15635-1-wsa+renesas@sang-engineering.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

cdeccd13

Merge branch 'net-refcount-address-dst_entry-reference-count-scalability-issues' · 2600badf

Jakub Kicinski authored Mar 28, 2023

Thomas Gleixner says:

====================
net, refcount: Address dst_entry reference count scalability issues

This is version 3 of this series. Version 2 can be found here:

     https://lore.kernel.org/lkml/20230307125358.772287565@linutronix.de

Wangyang and Arjan reported a bottleneck in the networking code related to
struct dst_entry::__refcnt. Performance tanks massively when concurrency on
a dst_entry increases.

This happens when there are a large amount of connections to or from the
same IP address. The memtier benchmark when run on the same host as
memcached amplifies this massively. But even over real network connections
this issue can be observed at an obviously smaller scale (due to the
network bandwith limitations in my setup, i.e. 1Gb). How to reproduce:

  Run memcached with -t $N and memtier_benchmark with -t $M and --ratio=1:100
  on the same machine. localhost connections amplify the problem.

  Start with the defaults for $N and $M and increase them. Depending on
  your machine this will tank at some point. But even in reasonably small
  $N, $M scenarios the refcount operations and the resulting false sharing
  fallout becomes visible in perf top. At some point it becomes the
  dominating issue.

There are two factors which make this reference count a scalability issue:

   1) False sharing

      dst_entry:__refcnt is located at offset 64 of dst_entry, which puts
      it into a seperate cacheline vs. the read mostly members located at
      the beginning of the struct.

      That prevents false sharing vs. the struct members in the first 64
      bytes of the structure, but there is also

      	    dst_entry::lwtstate

      which is located after the reference count and in the same cache
      line. This member is read after a reference count has been acquired.

      The other problem is struct rtable, which embeds a struct dst_entry
      at offset 0. struct dst_entry has a size of 112 bytes, which means
      that the struct members of rtable which follow the dst member share
      the same cache line as dst_entry::__refcnt. Especially

      	  rtable::rt_genid

      is also read by the contexts which have a reference count acquired
      already.

      When dst_entry:__refcnt is incremented or decremented via an atomic
      operation these read accesses stall and contribute to the performance
      problem.

   2) atomic_inc_not_zero()

      A reference on dst_entry:__refcnt is acquired via
      atomic_inc_not_zero() and released via atomic_dec_return().

      atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop,
      which exposes O(N^2) behaviour under contention with N concurrent
      operations. Contention scalability is degrading with even a small
      amount of contenders and gets worse from there.

      Lightweight instrumentation exposed an average of 8!! retry loops per
      atomic_inc_not_zero() invocation in a inc()/dec() loop running
      concurrently on 112 CPUs.

      There is nothing which can be done to make atomic_inc_not_zero() more
      scalable.

The following series addresses these issues:

    1) Reorder and pad struct dst_entry to prevent the false sharing.

    2) Implement and use a reference count implementation which avoids the
       atomic_inc_not_zero() problem.

       It is slightly less performant in the case of the final 0 -> -1
       transition, but the deconstruction of these objects is a low
       frequency event. get()/put() pairs are in the hotpath and that's
       what this implementation optimizes for.

       The algorithm of this reference count is only suitable for RCU
       managed objects. Therefore it cannot replace the refcount_t
       algorithm, which is also based on atomic_inc_not_zero(), due to a
       subtle race condition related to the 0 -> -1 transition and the final
       verdict to mark the reference count dead. See details in patch 2/3.

       It might be just my lack of imagination which declares this to be
       impossible and I'd be happy to be proven wrong.

       As a bonus the new rcuref implementation provides underflow/overflow
       detection and mitigation while being performance wise on par with
       open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in
       the non-contended case.

The combination of these two changes results in performance gains in micro
benchmarks and also localhost and networked memtier benchmarks talking to
memcached. It's hard to quantify the benchmark results as they depend
heavily on the micro-architecture and the number of concurrent operations.

The overall gain of both changes for localhost memtier ranges from 1.2X to
3.2X and from +2% to %5% range for networked operations on a 1Gb connection.

A micro benchmark which enforces maximized concurrency shows a gain between
1.2X and 4.7X!!!

Obviously this is focussed on a particular problem and therefore needs to
be discussed in detail. It also requires wider testing outside of the cases
which this is focussed on.

Though the false sharing issue is obvious and should be addressed
independent of the more focussed reference count changes.

The series is also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rcuref

Changes vs. V2:

  - Rename __refcnt to __rcuref (Linus)

  - Fix comments and changelogs (Mark, Qiuxu)

  - Fixup kernel doc of generated atomic_add_negative() variants

I want to say thanks to Wangyang who analyzed the issue and provided the
initial fix for the false sharing problem. Further thanks go to Arjan
Peter, Marc, Will and Borislav for valuable input and providing test
results on machines which I do not have access to, and to Linus and
Eric, Qiuxu and Mark for helpful feedback.
====================

Link: https://lore.kernel.org/r/20230323102649.764958589@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2600badf

net: dst: Switch to rcuref_t reference counting · bc9d3a9f

Thomas Gleixner authored Mar 23, 2023

Under high contention dst_entry::__refcnt becomes a significant bottleneck.

atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
high retry rates on contention.

Switch the reference count to rcuref_t which results in a significant
performance gain. Rename the reference count member to __rcuref to reflect
the change.

The gain depends on the micro-architecture and the number of concurrent
operations and has been measured in the range of +25% to +130% with a
localhost memtier/memcached benchmark which amplifies the problem
massively.

Running the memtier/memcached benchmark over a real (1Gb) network
connection the conversion on top of the false sharing fix for struct
dst_entry::__refcnt results in a total gain in the 2%-5% range over the
upstream baseline.
Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de
Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bc9d3a9f

net: dst: Prevent false sharing vs. dst_entry:: __refcnt · d288a162

Wangyang Guo authored Mar 23, 2023

dst_entry::__refcnt is highly contended in scenarios where many connections
happen from and to the same IP. The reference count is an atomic_t, so the
reference count operations have to take the cache-line exclusive.

Aside of the unavoidable reference count contention there is another
significant problem which is caused by that: False sharing.

perf top identified two affected read accesses. dst_entry::lwtstate and
rtable::rt_genid.

dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into
a seperate cacheline vs. the read mostly members located at the beginning
of the struct.

That prevents false sharing vs. the struct members in the first 64
bytes of the structure, but there is also

  dst_entry::lwtstate

which is located after the reference count and in the same cache line. This
member is read after a reference count has been acquired.

struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a
size of 112 bytes, which means that the struct members of rtable which
follow the dst member share the same cache line as dst_entry::__refcnt.
Especially

  rtable::rt_genid

is also read by the contexts which have a reference count acquired
already.

When dst_entry:__refcnt is incremented or decremented via an atomic
operation these read accesses stall. This was found when analysing the
memtier benchmark in 1:100 mode, which amplifies the problem extremly.

Move the rt[6i]_uncached[_list] members out of struct rtable and struct
rt6_info into struct dst_entry to provide padding and move the lwtstate
member after that so it ends up in the same cache line.

The resulting improvement depends on the micro-architecture and the number
of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached
benchmark.

[ tglx: Rearrange struct ]
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230323102800.042297517@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d288a162

Merge branch 'locking/rcuref' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · b133fffe

Jakub Kicinski authored Mar 28, 2023

Pulling rcurefs from Peter for tglx's work.

Link: https://lore.kernel.org/all/20230328084534.GE4253@hirez.programming.kicks-ass.net/Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b133fffe

28 Mar, 2023 19 commits

net/mlx5e: Fix build break on 32bit · 163c2c70

Saeed Mahameed authored Mar 28, 2023

The cited commit caused the following build break in mlx5 due to a change
in size of MAX_SKB_FRAGS.

error: format '%lu' expects argument of type 'long unsigned int',
but argument 7 has type 'unsigned int' [-Werror=format=]

Fix this by explicit casting.

Fixes: 3948b059 ("net: introduce a config option to tweak MAX_SKB_FRAGS")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20230328200723.125122-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

163c2c70

net: ethernet: ti: am65-cpsw: enable p0 host port rx_vlan_remap · 86e2eca4

Grygorii Strashko authored Mar 27, 2023

By default, the tagged ingress packets to the switch from the host port
P0 get internal switch priority assigned equal to the DMA CPPI channel
number they came from, unless CPSW_P0_CONTROL_REG.RX_REMAP_VLAN is enabled.
This causes issues with applying QoS policies and mapping packets on
external port fifos, because the default configuration is vlan_aware and
DMA CPPI channels are shared between all external ports.

Hence enable CPSW_P0_CONTROL_REG.RX_REMAP_VLAN so packet will preserve
internal switch priority assigned following the VLAN(priority) tag no
matter through which DMA CPPI Channels packets enter the switch.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Link: https://lore.kernel.org/r/20230327092103.3256118-1-s-vadapalli@ti.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

86e2eca4

net: ethernet: ti: am65-cpsw: add .ndo to set dma per-queue rate · 5c8560c4

Grygorii Strashko authored Mar 27, 2023

Enable rate limiting TX DMA queues for CPSW interface by configuring the
rate in absolute Mb/s units per TX queue.

Example:
    ethtool -L eth0 tx 4

    echo 100 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
    echo 200 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
    echo 50 > /sys/class/net/eth0/queues/tx-2/tx_maxrate
    echo 30 > /sys/class/net/eth0/queues/tx-3/tx_maxrate

    # disable
    echo 0 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Link: https://lore.kernel.org/r/20230327085758.3237155-1-s-vadapalli@ti.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

5c8560c4

Merge branch 'allocate-multiple-skbuffs-on-tx' · d8b0c963

Paolo Abeni authored Mar 28, 2023

Arseniy Krasnov says:

====================
allocate multiple skbuffs on tx

This adds small optimization for tx path: instead of allocating single
skbuff on every call to transport, allocate multiple skbuff's until
credit space allows, thus trying to send as much as possible data without
return to af_vsock.c.

Also this patchset includes second patch which adds check and return from
'virtio_transport_get_credit()' and 'virtio_transport_put_credit()' when
these functions are called with 0 argument. This is needed, because zero
argument makes both functions to behave as no-effect, but both of them
always tries to acquire spinlock. Moreover, first patch always calls
function 'virtio_transport_put_credit()' with zero argument in case of
successful packet transmission.
====================

Link: https://lore.kernel.org/r/b0d15942-65ba-3a32-ba8d-fed64332d8f6@sberdevices.ruSigned-off-by: Paolo Abeni <pabeni@redhat.com>

d8b0c963

virtio/vsock: check argument to avoid no effect call · e3ec366e

Arseniy Krasnov authored Mar 26, 2023

Both of these functions have no effect when input argument is 0, so to
avoid useless spinlock access, check argument before it.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

e3ec366e

virtio/vsock: allocate multiple skbuffs on tx · b68ffb1b

Arseniy Krasnov authored Mar 26, 2023

This adds small optimization for tx path: instead of allocating single
skbuff on every call to transport, allocate multiple skbuff's until
credit space allows, thus trying to send as much as possible data without
return to af_vsock.c.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

b68ffb1b

atomics: Provide rcuref - scalable reference counting · ee1ee6db

Thomas Gleixner authored Mar 23, 2023

atomic_t based reference counting, including refcount_t, uses
atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is
implemented with a atomic_try_cmpxchg() loop. High contention of the
reference count leads to retry loops and scales badly. There is nothing to
improve on this implementation as the semantics have to be preserved.

Provide rcuref as a scalable alternative solution which is suitable for RCU
managed objects. Similar to refcount_t it comes with overflow and underflow
detection and mitigation.

rcuref treats the underlying atomic_t as an unsigned integer and partitions
this space into zones:

  0x00000000 - 0x7FFFFFFF	valid zone (1 .. (INT_MAX + 1) references)
  0x80000000 - 0xBFFFFFFF	saturation zone
  0xC0000000 - 0xFFFFFFFE	dead zone
  0xFFFFFFFF   			no reference

rcuref_get() unconditionally increments the reference count with
atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the
reference count with atomic_add_negative_release().

This unconditional increment avoids the inc_not_zero() problem, but
requires a more complex implementation on the put() side when the count
drops from 0 to -1.

When this transition is detected then it is attempted to mark the reference
count dead, by setting it to the midpoint of the dead zone with a single
atomic_cmpxchg_release() operation. This operation can fail due to a
concurrent rcuref_get() elevating the reference count from -1 to 0 again.

If the unconditional increment in rcuref_get() hits a reference count which
is marked dead (or saturated) it will detect it after the fact and bring
back the reference count to the midpoint of the respective zone. The zones
provide enough tolerance which makes it practically impossible to escape
from a zone.

The racy implementation of rcuref_put() requires to protect rcuref_put()
against a grace period ending in order to prevent a subtle use after
free. As RCU is the only mechanism which allows to protect against that, it
is not possible to fully replace the atomic_inc_not_zero() based
implementation of refcount_t with this scheme.

The final drop is slightly more expensive than the atomic_dec_return()
counterpart, but that's not the case which this is optimized for. The
optimization is on the high frequeunt get()/put() pairs and their
scalability.

The performance of an uncontended rcuref_get()/put() pair where the put()
is not dropping the last reference is still on par with the plain atomic
operations, while at the same time providing overflow and underflow
detection and mitigation.

The performance of rcuref compared to plain atomic_inc_not_zero() and
atomic_dec_return() based reference counting under contention:

 -  Micro benchmark: All CPUs running a increment/decrement loop on an
    elevated reference count, which means the 0 to -1 transition never
    happens.

    The performance gain depends on microarchitecture and the number of
    CPUs and has been observed in the range of 1.3X to 4.7X

 - Conversion of dst_entry::__refcnt to rcuref and testing with the
    localhost memtier/memcached benchmark. That benchmark shows the
    reference count contention prominently.

    The performance gain depends on microarchitecture and the number of
    CPUs and has been observed in the range of 1.1X to 2.6X over the
    previous fix for the false sharing issue vs. struct
    dst_entry::__refcnt.

    When memtier is run over a real 1Gb network connection, there is a
    small gain on top of the false sharing fix. The two changes combined
    result in a 2%-5% total gain for that networked test.
Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230323102800.158429195@linutronix.de

ee1ee6db

atomics: Provide atomic_add_negative() variants · e5ab9eff

Thomas Gleixner authored Mar 23, 2023

atomic_add_negative() does not provide the relaxed/acquire/release
variants.

Provide them in preparation for a new scalable reference count algorithm.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230323102800.101763813@linutronix.de

e5ab9eff

Merge tag 'linux-can-next-for-6.4-20230327' of... · 4cee0fb9

Jakub Kicinski authored Mar 27, 2023

Merge tag 'linux-can-next-for-6.4-20230327' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2023-03-27

The first 2 patches by Geert Uytterhoeven add transceiver support and
improve the error messages in the rcar_canfd driver.

Cai Huoqing contributes 3 patches which remove a redundant call to
pci_clear_master() in the c_can, ctucanfd and kvaser_pciefd driver.

Frank Jungclaus's patch replaces the struct esd_usb_msg with a union
in the esd_usb driver to improve readability.

Markus Schneider-Pargmann contributes 5 patches to improve the
performance in the m_can driver, especially for SPI attached
controllers like the tcan4x5x.

* tag 'linux-can-next-for-6.4-20230327' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
  can: m_can: Keep interrupts enabled during peripheral read
  can: m_can: Disable unused interrupts
  can: m_can: Remove double interrupt enable
  can: m_can: Always acknowledge all interrupts
  can: m_can: Remove repeated check for is_peripheral
  can: esd_usb: Improve code readability by means of replacing struct esd_usb_msg with a union
  can: kvaser_pciefd: Remove redundant pci_clear_master
  can: ctucanfd: Remove redundant pci_clear_master
  can: c_can: Remove redundant pci_clear_master
  can: rcar_canfd: Improve error messages
  can: rcar_canfd: Add transceiver support
====================

Link: https://lore.kernel.org/r/20230327073354.1003134-1-mkl@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4cee0fb9

Merge branch 'add-tx-push-buf-len-param-to-ethtool' · da954ae1

Jakub Kicinski authored Mar 27, 2023

Shay Agroskin says:

====================
Add tx push buf len param to ethtool

This patchset adds a new sub-configuration to ethtool get/set queue
params (ethtool -g) called 'tx-push-buf-len'.

This configuration specifies the maximum number of bytes of a
transmitted packet a driver can push directly to the underlying
device ('push' mode). The motivation for pushing some of the bytes to
the device has the advantages of

- Allowing a smart device to take fast actions based on the packet's
  header
- Reducing latency for small packets that can be copied completely into
  the device

This new param is practically similar to tx-copybreak value that can be
set using ethtool's tunable but conceptually serves a different purpose.
While tx-copybreak is used to reduce the overhead of DMA mapping and
makes no sense to use if less than the whole segment gets copied,
tx-push-buf-len allows to improve performance by analyzing the packet's
data (usually headers) before performing the DMA operation.

The configuration can be queried and set using the commands:

    $ ethtool -g [interface]

    # ethtool -G [interface] tx-push-buf-len [number of bytes]

This patchset also adds support for the new configuration in ENA driver
for which this parameter ensures efficient resources management on the
device side.
====================

Link: https://lore.kernel.org/r/20230323163610.1281468-1-shayagr@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

da954ae1

net: ena: Advertise TX push support · 060cdac2

Shay Agroskin authored Mar 23, 2023

LLQ is auto enabled by the device and disabling it isn't supported on
new ENA generations while on old ones causes sub-optimal performance.

This patch adds advertisement of push-mode when LLQ mode is used, but
rejects an attempt to modify it.
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

060cdac2

net: ena: Add support to changing tx_push_buf_len · b0c59e53

Shay Agroskin authored Mar 23, 2023

The ENA driver allows for two distinct values for the number of bytes
of the packet's payload that can be written directly to the device.

For a value of 224 the driver turns on Large LLQ Header mode in which
the first 224 of the packet's payload are written to the LLQ.
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b0c59e53

net: ena: Recalculate TX state variables every device reset · a416cb25

Shay Agroskin authored Mar 23, 2023

With the ability to modify LLQ entry size, the size of packet's
payload that can be written directly to the device changes.
This patch makes the driver recalculate this information every device
negotiation (also called device reset).
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a416cb25

net: ena: Add an option to configure large LLQ headers · 1e366688

David Arinzon authored Mar 23, 2023

Allow configuring the device with large LLQ headers. The Low Latency
Queue (LLQ) allows the driver to write the first N bytes of the packet,
along with the rest of the TX descriptors directly into device (N can be
either 96 or 224 for large LLQ headers configuration).

Having L4 TCP/UDP headers contained in the first 96 bytes of the packet
is required to get maximum performance from the device.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

1e366688

net: ena: Make few cosmetic preparations to support large LLQ · 3a091084

Shay Agroskin authored Mar 23, 2023

Move ena_calc_io_queue_size() implementation closer to the file's
beginning so that it can be later called from ena_device_init()
function without adding a function declaration.

Also add an empty line at some spots to separate logical blocks in
funcitons.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3a091084

ethtool: Add support for configuring tx_push_buf_len · 233eb4e7

Shay Agroskin authored Mar 23, 2023

This attribute, which is part of ethtool's ring param configuration
allows the user to specify the maximum number of the packet's payload
that can be written directly to the device.

Example usage:
    # ethtool -G [interface] tx-push-buf-len [number of bytes]
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

233eb4e7

netlink: Add a macro to set policy message with format string · 3e4d5ba9

Shay Agroskin authored Mar 23, 2023

Similar to NL_SET_ERR_MSG_FMT, add a macro which sets netlink policy
error message with a format string.
Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3e4d5ba9

qed: remove unused num_ooo_add_to_peninsula variable · 2bcc74ff

Tom Rix authored Mar 25, 2023

clang with W=1 reports
drivers/net/ethernet/qlogic/qed/qed_ll2.c:649:6: error: variable
  'num_ooo_add_to_peninsula' set but not used [-Werror,-Wunused-but-set-variable]
        u32 num_ooo_add_to_peninsula = 0, cid;
            ^
This variable is not used so remove it.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230326001733.1343274-1-trix@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2bcc74ff

net: introduce a config option to tweak MAX_SKB_FRAGS · 3948b059

Eric Dumazet authored Mar 23, 2023

Currently, MAX_SKB_FRAGS value is 17.

For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
attempts order-3 allocations, stuffing 32768 bytes per frag.

But with zero copy, we use order-0 pages.

For BIG TCP to show its full potential, we add a config option
to be able to fit up to 45 segments per skb.

This is also needed for BIG TCP rx zerocopy, as zerocopy currently
does not support skbs with frag list.

We have used MAX_SKB_FRAGS=45 value for years at Google before
we deployed 4K MTU, with no adverse effect, other than
a recent issue in mlx4, fixed in commit 26782aad
("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")

Back then, goal was to be able to receive full size (64KB) GRO
packets without the frag_list overhead.

Note that /proc/sys/net/core/max_skb_frags can also be used to limit
the number of fragments TCP can use in tx packets.

By default we keep the old/legacy value of 17 until we get
more coverage for the updated values.

Sizes of struct skb_shared_info on 64bit arches

MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
==============================================
         17     320
         21     320+64  = 384
         25     320+128 = 448
         29     320+192 = 512
         33     320+256 = 576
         37     320+320 = 640
         41     320+384 = 704
         45     320+448 = 768

This inflation might cause problems for drivers assuming they could pack
both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
using build_skb().

v3: fix build error when CONFIG_NET=n
v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20230323162842.1935061-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3948b059

27 Mar, 2023 12 commits

dev_ioctl: fix a W=1 warning · e5b42483

Heiner Kallweit authored Mar 24, 2023

This fixes the following warning when compiled with GCC 12.2.0 and W=1.

net/core/dev_ioctl.c:475: warning: Function parameter or member 'data'
not described in 'dev_ioctl'
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e5b42483

net: phy: bcm7xxx: use devm_clk_get_optional_enabled to simplify the code · 4228c3a2

Heiner Kallweit authored Mar 24, 2023

Use devm_clk_get_optional_enabled to simplify the code.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4228c3a2

tools: ynl: default to treating enums as flags for mask generation · 4c6170d1

Jakub Kicinski authored Mar 24, 2023

I was a bit too optimistic in commit bf51d277 ("tools: ynl: fix
get_mask utility routine"), not every mask we use is necessarily
coming from an enum of type "flags". We also allow flipping an
enum into flags on per-attribute basis. That's done by
the 'enum-as-flags' property of an attribute.

Restore this functionality, it's not currently used by any in-tree
family.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

4c6170d1

selftests: tls: add a test for queuing data before setting the ULP · a504d246

Jakub Kicinski authored Mar 24, 2023

Other tests set up the connection fully on both ends before
communicating any data. Add a test which will queue up TLS
records to TCP before the TLS ULP is installed.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a504d246

tools: ynl: Add missing types to encode/decode · dd3a7d58

Michal Michalik authored Mar 24, 2023

While testing the tool I noticed we miss the u16 type on payload create.
On the code inspection it turned out we miss also u64 - add them.

We also miss the decoding of u16 despite the fact `NlAttr` class
supports it - add it.
Signed-off-by: Michal Michalik <michal.michalik@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

dd3a7d58

Merge branch 'sunhme-cleanups' · fe5b9907

David S. Miller authored Mar 27, 2023

Sean Anderson says:

====================
net: sunhme: Probe/IRQ cleanups

Well, I've had these patches kicking around in my tree since last
October, so I guess I had better get around to posting them. This
series is mainly a cleanup/consolidation of the probe process, with
some interrupt changes as well.  Some of these changes are SBUS- (AKA
SPARC-) specific, so this should really get some testing there as well
to ensure nothing breaks. I've CC'd a few SPARC mailing lists in hopes
that someone there can try this out. I also have an SBUS card I
ordered by mistake if anyone has a SPARC computer but lacks this card.

Changes in v4:
- Tweak variable order for yuletide
- Move uninitialized return to its own commit
- Use correct SBUS/PCI accessors
- Rework hme_version to set the default in pci/sbus_probe and override it (if
  necessary) in common_probe

Changes in v3:
- Incorperate a fix from another series into this commit

Changes in v2:
- Move happy_meal_begin_auto_negotiation earlier and remove forward declaration
- Make some more includes common
- Clean up mac address init
- Inline error returns
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

fe5b9907

net: sunhme: Consolidate common probe tasks · ecdcd042

Sean Anderson authored Mar 24, 2023

Most of the second half of the PCI/SBUS probe functions are the same.
Consolidate them into a common function.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ecdcd042

net: sunhme: Inline error returns · 902fe6e9

Sean Anderson authored Mar 24, 2023

The err_out label used to have cleanup. Now that it just returns, inline it
everywhere.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

902fe6e9

net: sunhme: Clean up mac address init · d1f08819

Sean Anderson authored Mar 24, 2023

Clean up some oddities suggested during review.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d1f08819

net: sunhme: Consolidate mac address initialization · 273fb669

Sean Anderson authored Mar 24, 2023

The mac address initialization is braodly the same between PCI and SBUS,
and one was clearly copied from the other. Consolidate them. We still have
to have some ifdefs because pci_(un)map_rom is only implemented for PCI,
and idprom is only implemented for SPARC.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

273fb669

net: sunhme: Switch SBUS to devres · cc216e4b

Sean Anderson authored Mar 24, 2023

The PCI half of this driver was converted in commit 914d9b27 ("sunhme:
switch to devres"). Do the same for the SBUS half.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cc216e4b

net: sunhme: Alphabetize includes · 1ff4f42a

Sean Anderson authored Mar 24, 2023

Alphabetize includes to make it clearer where to add new ones.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1ff4f42a