Commits · 750771d0ca76817e15fef1211b9748ae7ed3aff6 · Kirill Smelkov / linux

06 May, 2024 11 commits

gtp: prepare for IPv6 support · 750771d0

Pablo Neira Ayuso authored May 07, 2024

Use union artifact to prepare for IPv6 support.
Add and use GTP_{IPV4,TH}_MAXLEN.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

750771d0

gtp: properly parse extension headers · b6fc0956

Pablo Neira Ayuso authored May 07, 2024

Currently GTP packets are dropped if the next extension field is set to
non-zero value, but this are valid GTP packets.

TS 29.281 provides a longer header format, which is defined as struct
gtp1_header_long. Such long header format is used if any of the S, PN, E
flags is set.

This long header is 4 bytes longer than struct gtp1_header, plus
variable length (optional) extension headers. The next extension header
field is zero is no extension header is provided.

The extension header is composed of a length field which includes total
number of 4 byte words including the extension header itself (1 byte),
payload (variable length) and next type (1 byte). The extension header
size and its payload is aligned to 4 bytes.

A GTP packet might come with a chain extensions headers, which makes it
slightly cumbersome to parse because the extension next header field
comes at the end of the extension header, and there is a need to check
if this field becomes zero to stop the extension header parser.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

b6fc0956

gtp: remove useless initialization · 353f5ffb

Pablo Neira Ayuso authored May 07, 2024

Update b20dc3c6 ("gtp: Allow to create GTP device without FDs") to
remove useless initialization to NULL, sockets are initialized to
non-NULL just a few lines of code after this.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

353f5ffb

Merge branch 'add-tcp-fraglist-gro-support' · 8c4e4798

Paolo Abeni authored May 06, 2024

Felix Fietkau says:

====================
Add TCP fraglist GRO support

When forwarding TCP after GRO, software segmentation is very expensive,
especially when the checksum needs to be recalculated.
One case where that's currently unavoidable is when routing packets over
PPPoE. Performance improves significantly when using fraglist GRO
implemented in the same way as for UDP.

When NETIF_F_GRO_FRAGLIST is enabled, perform a lookup for an established
socket in the same netns as the receiving device. While this may not
cover all relevant use cases in multi-netns configurations, it should be
good enough for most configurations that need this.

Here's a measurement of running 2 TCP streams through a MediaTek MT7622
device (2-core Cortex-A53), which runs NAT with flow offload enabled from
one ethernet port to PPPoE on another ethernet port + cake qdisc set to
1Gbps.

rx-gro-list off: 630 Mbit/s, CPU 35% idle
rx-gro-list on:  770 Mbit/s, CPU 40% idle

Changes since v4:
 - add likely() to prefer the non-fraglist path in check

Changes since v3:
 - optimize __tcpv4_gso_segment_csum
 - add unlikely()
 - reorder dev_net/skb_gro_network_header calls after NETIF_F_GRO_FRAGLIST
   check
 - add support for ipv6 nat
 - drop redundant pskb_may_pull check

Changes since v2:
 - create tcp_gro_header_pull helper function to pull tcp header only once
 - optimize __tcpv4_gso_segment_list_csum, drop obsolete flags check

Changes since v1:
 - revert bogus tcp flags overwrite on segmentation
 - fix kbuild issue with !CONFIG_IPV6
 - only perform socket lookup for the first skb in the GRO train

Changes since RFC:
 - split up patches
 - handle TCP flags mutations
====================

Link: https://lore.kernel.org/r/20240502084450.44009-1-nbd@nbd.nameSigned-off-by: Paolo Abeni <pabeni@redhat.com>

8c4e4798

net: add heuristic for enabling TCP fraglist GRO · c9d1d23e

Felix Fietkau authored May 02, 2024

When forwarding TCP after GRO, software segmentation is very expensive,
especially when the checksum needs to be recalculated.
One case where that's currently unavoidable is when routing packets over
PPPoE. Performance improves significantly when using fraglist GRO
implemented in the same way as for UDP.

When NETIF_F_GRO_FRAGLIST is enabled, perform a lookup for an established
socket in the same netns as the receiving device. While this may not
cover all relevant use cases in multi-netns configurations, it should be
good enough for most configurations that need this.

Here's a measurement of running 2 TCP streams through a MediaTek MT7622
device (2-core Cortex-A53), which runs NAT with flow offload enabled from
one ethernet port to PPPoE on another ethernet port + cake qdisc set to
1Gbps.

rx-gro-list off: 630 Mbit/s, CPU 35% idle
rx-gro-list on:  770 Mbit/s, CPU 40% idle
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

c9d1d23e

net: create tcp_gro_header_pull helper function · 7516b27c

Felix Fietkau authored May 02, 2024

Pull the code out of tcp_gro_receive in order to access the tcp header
from tcp4/6_gro_receive.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

7516b27c

net: create tcp_gro_lookup helper function · 80e85fbd

Felix Fietkau authored May 02, 2024

This pulls the flow port matching out of tcp_gro_receive, so that it can be
reused for the next change, which adds the TCP fraglist GRO heuristic.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

80e85fbd

net: add code for TCP fraglist GRO · 8d95dc47

Felix Fietkau authored May 02, 2024

This implements fraglist GRO similar to how it's handled in UDP, however
no functional changes are added yet. The next change adds a heuristic for
using fraglist GRO instead of regular GRO.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

8d95dc47

net: add support for segmenting TCP fraglist GSO packets · bee88cd5

Felix Fietkau authored May 02, 2024

Preparation for adding TCP fraglist GRO support. It expects packets to be
combined in a similar way as UDP fraglist GSO packets.
For IPv4 packets, NAT is handled in the same way as UDP fraglist GSO.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bee88cd5

net: move skb_gro_receive_list from udp to core · 8928756d

Felix Fietkau authored May 02, 2024

This helper function will be used for TCP fraglist GRO support
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

8928756d

net: microchip: lan743x: Reduce PTP timeout on HW failure · b1de3c0d

Rengarajan S authored May 02, 2024

The PTP_CMD_CTL is a self clearing register which controls the PTP clock
values. In the current implementation driver waits for a duration of 20
sec in case of HW failure to clear the PTP_CMD_CTL register bit. This
timeout of 20 sec is very long to recognize a HW failure, as it is
typically cleared in one clock(<16ns). Hence reducing the timeout to 1 sec
would be sufficient to conclude if there is any HW failure observed. The
usleep_range will sleep somewhere between 1 msec to 20 msec for each
iteration. By setting the PTP_CMD_CTL_TIMEOUT_CNT to 50 the max timeout
is extended to 1 sec.
Signed-off-by: Rengarajan S <rengarajan.s@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240502050300.38689-1-rengarajan.s@microchip.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

b1de3c0d

05 May, 2024 10 commits

Merge branch 'gve-queue-api' · cdc74c9d

David S. Miller authored May 05, 2024

Shailend Chand says:

====================
gve: Implement queue api

Following the discussion on
https://patchwork.kernel.org/project/linux-media/patch/20240305020153.2787423-2-almasrymina@google.com/,
the queue api defined by Mina is implemented for gve.

The first patch is just Mina's introduction of the api. The rest of the
patches make surgical changes in gve to enable it to work correctly with
only a subset of queues present (thus far it had assumed that either all
queues are up or all are down). The final patch has the api
implementation.

Changes since v1: clang warning fixes, kdoc warning fix, and addressed
review comments.
====================
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cdc74c9d

gve: Alloc and free QPLs with the rings · ee24284e

Shailend Chand authored May 01, 2024

Every tx and rx ring has its own queue-page-list (QPL) that serves as
the bounce buffer. Previously we were allocating QPLs for all queues
before the queues themselves were allocated and later associating a QPL
with a queue. This is avoidable complexity: it is much more natural for
each queue to allocate and free its own QPL.

Moreover, the advent of new queue-manipulating ndo hooks make it hard to
keep things as is: we would need to transfer a QPL from an old queue to
a new queue, and that is unpleasant.
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee24284e

gve: Account for stopped queues when reading NIC stats · af9bcf91

Shailend Chand authored May 01, 2024

We now account for the fact that the NIC might send us stats for a
subset of queues. Without this change, gve_get_ethtool_stats might make
an invalid access on the priv->stats_report->stats array.
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af9bcf91

gve: Reset Rx ring state in the ring-stop funcs · 770f52d5

Shailend Chand authored May 01, 2024

This does not fix any existing bug. In anticipation of the ndo queue api
hooks that alloc/free/start/stop a single Rx queue, the already existing
per-queue stop functions are being made more robust. Specifically for
this use case: rx_queue_n.stop() + rx_queue_n.start()

Note that this is not the use case being used in devmem tcp (the first
place these new ndo hooks would be used). There the usecase is:
new_queue.alloc() + old_queue.stop() + new_queue.start() + old_queue.free()
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

770f52d5

gve: Avoid rescheduling napi if on wrong cpu · 9a5e0776

Shailend Chand authored May 01, 2024

In order to make possible the implementation of per-queue ndo hooks,
gve_turnup was changed in a previous patch to account for queues already
having some unprocessed descriptors: it does a one-off napi_schdule to
handle them. If conditions of consistent high traffic persist in the
immediate aftermath of this, the poll routine for a queue can be "stuck"
on the cpu on which the ndo hooks ran, instead of the cpu its irq has
affinity with.

This situation is exacerbated by the fact that the ndo hooks for all the
queues are invoked on the same cpu, potentially causing all the napi
poll routines to be residing on the same cpu.

A self correcting mechanism in the poll method itself solves this
problem.
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9a5e0776

gve: Make gve_turnup work for nonempty queues · 864616d9

Shailend Chand authored May 01, 2024

gVNIC has a requirement that all queues have to be quiesced before any
queue is operated on (created or destroyed). To enable the
implementation of future ndo hooks that work on a single queue, we need
to evolve gve_turnup to account for queues already having some
unprocessed descriptors in the ring.

Say rxq 4 is being stopped and started via the queue api. Due to gve's
requirement of quiescence, queues 0 through 3 are not processing their
rings while queue 4 is being toggled. Once they are made live, these
queues need to be poked to cause them to check their rings for
descriptors that were written during their brief period of quiescence.
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

864616d9

gve: Make gve_turn(up|down) ignore stopped queues · 5abc37bd

Shailend Chand authored May 01, 2024

Currently the queues are either all live or all dead, toggling from one
state to the other via the ndo open and stop hooks. The future addition
of single-queue ndo hooks changes this, and thus gve_turnup and
gve_turndown should evolve to account for a state where some queues are
live and some aren't.
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5abc37bd

gve: Add adminq funcs to add/remove a single Rx queue · 242f30fe

Shailend Chand authored May 01, 2024

This allows for implementing future ndo hooks that act on a single
queue.
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

242f30fe

gve: Make the GQ RX free queue funcs idempotent · dcecfcf2

Shailend Chand authored May 01, 2024

Although this is not fixing any existing double free bug, making these
functions idempotent allows for a simpler implementation of future ndo
hooks that act on a single queue.
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dcecfcf2

queue_api: define queue api · 087b24de

Mina Almasry authored May 01, 2024

This API enables the net stack to reset the queues used for devmem TCP.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

087b24de

03 May, 2024 19 commits

Revert "net: mirror skb frag ref/unref helpers" · 173e7622

Mina Almasry authored May 02, 2024

This reverts commit a580ea99.

This revert is to resolve Dragos's report of page_pool leak here:
https://lore.kernel.org/lkml/20240424165646.1625690-2-dtatulea@nvidia.com/

The reverted patch interacts very badly with commit 2cc3aeb5 ("skbuff:
Fix a potential race while recycling page_pool packets"). The reverted
commit hopes that the pp_recycle + is_pp_page variables do not change
between the skb_frag_ref and skb_frag_unref operation. If such a change
occurs, the skb_frag_ref/unref will not operate on the same reference type.
In the case of Dragos's report, the grabbed ref was a pp ref, but the unref
was a page ref, because the pp_recycle setting on the skb was changed.

Attempting to fix this issue on the fly is risky. Lets revert and I hope
to reland this with better understanding and testing to ensure we don't
regress some edge case while streamlining skb reffing.

Fixes: a580ea99 ("net: mirror skb frag ref/unref helpers")
Reported-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20240502175423.2456544-1-almasrymina@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

173e7622

bnxt: fix bnxt_get_avail_msix() returning negative values · 5bfadc57

David Wei authored May 02, 2024

Current net-next/main does not boot for older chipsets e.g. Stratus.

Sample dmesg:
[   11.368315] bnxt_en 0000:02:00.0 (unnamed net_device) (uninitialized): Able to reserve only 0 out of 9 requested RX rings
[   11.390181] bnxt_en 0000:02:00.0 (unnamed net_device) (uninitialized): Unable to reserve tx rings
[   11.438780] bnxt_en 0000:02:00.0 (unnamed net_device) (uninitialized): 2nd rings reservation failed.
[   11.487559] bnxt_en 0000:02:00.0 (unnamed net_device) (uninitialized): Not enough rings available.
[   11.506012] bnxt_en 0000:02:00.0: probe with driver bnxt_en failed with error -12

This is caused by bnxt_get_avail_msix() returning a negative value for
these chipsets not using the new resource manager i.e. !BNXT_NEW_RM.
This in turn causes hwr.cp in __bnxt_reserve_rings() to be set to 0.

In the current call stack, __bnxt_reserve_rings() is called from
bnxt_set_dflt_rings() before bnxt_init_int_mode(). Therefore,
bp->total_irqs is always 0 and for !BNXT_NEW_RM bnxt_get_avail_msix()
always returns a negative number.

Historically, MSIX vectors were requested by the RoCE driver during
run-time and bnxt_get_avail_msix() was used for this purpose. Today,
RoCE MSIX vectors are statically allocated. bnxt_get_avail_msix() should
only be called for the BNXT_NEW_RM() case to reserve the MSIX ahead of
time for RoCE use.

bnxt_get_avail_msix() is also be simplified to handle the BNXT_NEW_RM()
case only.

Fixes: d630624e ("bnxt_en: Utilize ulp client resources if RoCE is not registered")
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240502203757.3761827-1-dw@davidwei.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>

5bfadc57

net: no longer acquire RTNL in threaded_show() · c1742dcb

Eric Dumazet authored May 02, 2024

dev->threaded can be read locklessly, if we add
corresponding READ_ONCE()/WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240502173926.2010646-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c1742dcb

tools: ynl: add --list-ops and --list-msgs to CLI · 3e51f2cb

Jakub Kicinski authored May 02, 2024

I often forget the exact naming of ops and have to look at
the spec to find it. Add support for listing the operations:

  $ ./cli.py --spec .../netdev.yaml --list-ops
  dev-get  [ do, dump ]
  page-pool-get  [ do, dump ]
  page-pool-stats-get  [ do, dump ]
  queue-get  [ do, dump ]
  napi-get  [ do, dump ]
  qstats-get  [ dump ]

For completeness also support listing all ops (including
notifications:

  # ./cli.py --spec .../netdev.yaml --list-msgs
  dev-get  [ dump, do ]
  dev-add-ntf  [ notify ]
  dev-del-ntf  [ notify ]
  dev-change-ntf  [ notify ]
  page-pool-get  [ dump, do ]
  page-pool-add-ntf  [ notify ]
  page-pool-del-ntf  [ notify ]
  page-pool-change-ntf  [ notify ]
  page-pool-stats-get  [ dump, do ]
  queue-get  [ dump, do ]
  napi-get  [ dump, do ]
  qstats-get  [ dump ]

Use double space after the name for slightly easier to read
output.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240502164043.2130184-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3e51f2cb

Merge branch 'rtnetlink-rtnl_stats_dump-changes' · f3ad4914

Jakub Kicinski authored May 03, 2024

Eric Dumazet says:

====================
rtnetlink: rtnl_stats_dump() changes

Getting rid of RTNL in rtnl_stats_dump() looks challenging.

In the meantime, we can:

1) Avoid RTNL acquisition for the final NLMSG_DONE marker.

2) Use for_each_netdev_dump() instead of the net->dev_index_head[]
   hash table.
====================

Link: https://lore.kernel.org/r/20240502113748.1622637-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f3ad4914

rtnetlink: use for_each_netdev_dump() in rtnl_stats_dump() · 0feb396f

Eric Dumazet authored May 02, 2024

Switch rtnl_stats_dump() to use for_each_netdev_dump()
instead of net->dev_index_head[] hash table.

This makes the code much easier to read, and fixes
scalability issues.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240502113748.1622637-3-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0feb396f

rtnetlink: change rtnl_stats_dump() return value · 136c2a9a

Eric Dumazet authored May 02, 2024

By returning 0 (or an error) instead of skb->len,
we allow NLMSG_DONE to be appended to the current
skb at the end of a dump, saving a couple of recvmsg()
system calls.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240502113748.1622637-2-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

136c2a9a

Merge branch 'net-sysctl-sentinel' · 5829614a

David S. Miller authored May 03, 2024

Joel Granados says:

====================
sysctl: Remove sentinel elements from networking

What?
These commits remove the sentinel element (last empty element) from the
sysctl arrays of all the files under the "net/" directory that register
a sysctl array. The merging of the preparation patches [4] to mainline
allows us to just remove sentinel elements without changing behavior.
This is safe because the sysctl registration code (register_sysctl() and
friends) use the array size in addition to checking for a sentinel [1].

Why?
By removing the sysctl sentinel elements we avoid kernel bloat as
ctl_table arrays get moved out of kernel/sysctl.c into their own
respective subsystems. This move was started long ago to avoid merge
conflicts; the sentinel removal bit came after Mathew Wilcox suggested
it to avoid bloating the kernel by one element as arrays moved out. This
patchset will reduce the overall build time size of the kernel and run
time memory bloat by about ~64 bytes per declared ctl_table array (more
info here [5]).

When are we done?
There are 4 patchest (25 commits [2]) that are still outstanding to
completely remove the sentinels: files under "net/" (this patchset),
files under "kernel/" dir, misc dirs (files under mm/ security/ and
others) and the final set that removes the unneeded check for ->procname
== NULL.

Testing:
* Ran sysctl selftests (./tools/testing/selftests/sysctl/sysctl.sh)
* Ran this through 0-day with no errors or warnings

Savings in vmlinux:
  A total of 64 bytes per sentinel is saved after removal; I measured in
  x86_64 to give an idea of the aggregated savings. The actual savings
  will depend on individual kernel configuration.
    * bloat-o-meter
        - The "yesall" config saves 3976 bytes (bloat-o-meter output [6])
        - A reduced config [3] saves 1263 bytes (bloat-o-meter output [7])

Savings in allocated memory:
  None in this set but will occur when the superfluous allocations are
  removed from proc_sysctl.c. I include it here for context. The
  estimated savings during boot for config [3] are 6272 bytes. See [8]
  for how to measure it.

Comments/feedback greatly appreciated

Changes in v6:
- Rebased onto net-next/main.
- Besides re-running my cocci scripts, I ran a new find script [9].
  Found 0 hits in net/
- Moved "i" variable declaraction out of for() in sysctl_core_net_init
- Removed forgotten sentinel in mpls_table
- Removed CONFIG_AX25_DAMA_SLAVE guard from net/ax25/ax25_ds_timer.c. It
  is not needed because that file is compiled only when
  CONFIG_AX25_DAMA_SLAVE is set.
- When traversing smc_table, stop on ARRAY_SIZE instead of ARRAY_SIZE-1.
- Link to v5: https://lore.kernel.org/r/20240426-jag-sysctl_remset_net-v5-0-e3b12f6111a6@samsung.com

Changes in v5:
- Added net files with additional variable to my test .config so the
  typo can be caught next time.
- Fixed typo tabel_size -> table_size
- Link to v4: https://lore.kernel.org/r/20240425-jag-sysctl_remset_net-v4-0-9e82f985777d@samsung.com

Changes in v4:
- Keep reverse xmas tree order when introducing new variables
- Use a table_size variable to keep the value of ARRAY_SIZE
- Separated the original "networking: Remove the now superfluous
  sentinel elements from ctl_table arra" into smaller commits to ease
  review
- Merged x.25 and ax.25 commits together.
- Removed any SOB from the commits that were changed
- Link to v3: https://lore.kernel.org/r/20240412-jag-sysctl_remset_net-v3-0-11187d13c211@samsung.com

Changes in v3:
- Reworkded ax.25
  - Added a BUILD_BUG_ON for the ax.25 commit
  - Added a CONFIG_AX25_DAMA_SLAVE guard where needed
- Link to v2: https://lore.kernel.org/r/20240328-jag-sysctl_remset_net-v2-0-52c9fad9a1af@samsung.com

Changes in v2:
- Rebased to v6.9-rc1
- Removed unneeded comment from sysctl_net_ax25.c
- Link to v1: https://lore.kernel.org/r/20240314-jag-sysctl_remset_net-v1-0-aa26b44d29d9@samsung.com
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5829614a

ax.25: x.25: Remove the now superfluous sentinel elements from ctl_table array · 78a7b5db

Joel Granados authored May 01, 2024

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Avoid a buffer overflow when traversing the ctl_table by ensuring that
AX25_MAX_VALUES is the same as the size of ax25_param_table. This is
done with a BUILD_BUG_ON where ax25_param_table is defined and a
CONFIG_AX25_DAMA_SLAVE guard in the unnamed enum definition as well as
in the ax25_dev_device_up and ax25_ds_set_timer functions.

The overflow happened when the sentinel was removed from
ax25_param_table. The sentinel's data element was changed when
CONFIG_AX25_DAMA_SLAVE was undefined. This had no adverse effects as it
still stopped on the sentinel's null procname but needed to be addressed
once the sentinel was removed.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

78a7b5db

appletalk: Remove the now superfluous sentinel elements from ctl_table array · e00e35e2

Joel Granados authored May 01, 2024

Remove sentinel from atalk_table ctl_table array.

Acked-by: Kees Cook <keescook@chromium.org> # loadpin & yama
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e00e35e2

netfilter: Remove the now superfluous sentinel elements from ctl_table array · 635470eb

Joel Granados authored May 01, 2024

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel elements from ctl_table structs
* Remove instances where an array element is zeroed out to make it look
  like a sentinel. This is not longer needed and is safe after commit
  c899710f ("networking: Update to register_net_sysctl_sz") added
  the array size to the ctl_table registration
* Remove the need for having __NF_SYSCTL_CT_LAST_SYSCTL as the
  sysctl array size is now in NF_SYSCTL_CT_LAST_SYSCTL
* Remove extra element in ctl_table arrays declarations

Acked-by: Kees Cook <keescook@chromium.org> # loadpin & yama
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

635470eb

net: Remove ctl_table sentinel elements from several networking subsystems · 73dbd8cf

Joel Granados authored May 01, 2024

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

To avoid lots of small commits, this commit brings together network
changes from (as they appear in MAINTAINERS) LLC, MPTCP, NETROM NETWORK
LAYER, PHONET PROTOCOL, ROSE NETWORK LAYER, RXRPC SOCKETS, SCTP
PROTOCOL, SHARED MEMORY COMMUNICATIONS (SMC), TIPC NETWORK LAYER and
NETWORKING [IPSEC]

* Remove sentinel element from ctl_table structs.
* Replace empty array registration with the register_net_sysctl_sz call
  in llc_sysctl_init
* Replace the for loop stop condition that tests for procname == NULL
  with one that depends on array size in sctp_sysctl_net_register
* Remove instances where an array element is zeroed out to make it look
  like a sentinel in xfrm_sysctl_init. This is not longer needed and is
  safe after commit c899710f ("networking: Update to
  register_net_sysctl_sz") added the array size to the ctl_table
  registration
* Use a table_size variable to keep the value of ARRAY_SIZE
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

73dbd8cf

net: sunrpc: Remove the now superfluous sentinel elements from ctl_table array · ca5d1fce

Joel Granados authored May 01, 2024

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ca5d1fce

net: rds: Remove the now superfluous sentinel elements from ctl_table array · 92bedf07

Joel Granados authored May 01, 2024

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Acked-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

92bedf07

net: ipv{6,4}: Remove the now superfluous sentinel elements from ctl_table array · 1c106eb0

Joel Granados authored May 01, 2024

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.
* Remove the zeroing out of an array element (to make it look like a
  sentinel) in sysctl_route_net_init And ipv6_route_sysctl_init.
  This is not longer needed and is safe after commit c899710f
  ("networking: Update to register_net_sysctl_sz") added the array size
  to the ctl_table registration.
* Remove extra sentinel element in the declaration of devinet_vars.
* Removed the "-1" in __devinet_sysctl_register, sysctl_route_net_init,
  ipv6_sysctl_net_init and ipv4_sysctl_init_net that adjusted for having
  an extra empty element when looping over ctl_table arrays
* Replace the for loop stop condition in __addrconf_sysctl_register that
  tests for procname == NULL with one that depends on array size
* Removing the unprivileged user check in ipv6_route_sysctl_init is
  safe as it is replaced by calling ipv6_route_sysctl_table_size;
  introduced in commit c899710f ("networking: Update to
  register_net_sysctl_sz")
* Use a table_size variable to keep the value of ARRAY_SIZE
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c106eb0

net: Remove the now superfluous sentinel elements from ctl_table array · ce218712

Joel Granados authored May 01, 2024

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

* Remove sentinel element from ctl_table structs.
* Remove the zeroing out of an array element (to make it look like a
  sentinel) in neigh_sysctl_register and lowpan_frags_ns_sysctl_register
  This is not longer needed and is safe after commit c899710f
  ("networking: Update to register_net_sysctl_sz") added the array size
  to the ctl_table registration.
* Replace the for loop stop condition in sysctl_core_net_init that tests
  for procname == NULL with one that depends on array size
* Removed the "-1" in mpls_net_init that adjusted for having an extra
  empty element when looping over ctl_table arrays
* Use a table_size variable to keep the value of ARRAY_SIZE
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ce218712

net_sched: sch_sfq: annotate data-races around q->perturb_period · a17ef9e6

Eric Dumazet authored Apr 30, 2024

sfq_perturbation() reads q->perturb_period locklessly.
Add annotations to fix potential issues.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240430180015.3111398-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a17ef9e6

net: dsa: mv88e6xxx: Correct check for empty list · 4c7f3950

Simon Horman authored Apr 30, 2024

Since commit a3c53be5 ("net: dsa: mv88e6xxx: Support multiple MDIO
busses") mv88e6xxx_default_mdio_bus() has checked that the
return value of list_first_entry() is non-NULL.

This appears to be intended to guard against the list chip->mdios being
empty.  However, it is not the correct check as the implementation of
list_first_entry is not designed to return NULL for empty lists.

Instead, use list_first_entry_or_null() which does return NULL if the
list is empty.

Flagged by Smatch.
Compile tested only.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240430-mv88e6xx-list_empty-v3-1-c35c69d88d2e@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4c7f3950

selftests/net: skip partial checksum packets in csum test · ec6f25bc

Willem de Bruijn authored May 01, 2024

Detect packets with ip_summed CHECKSUM_PARTIAL and skip these. These
should not exist, as the test sends individual packets between two
hosts. But if (HW) GRO is on, with randomized content sometimes
subsequent packets can be coalesced.

In this case the GSO packet checksum is converted to a pseudo checksum
in anticipation of sending out as TSO/USO. So the field will not match
the expected value.

Do not count these as test errors.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240501193156.3627344-1-willemdebruijn.kernel@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ec6f25bc