Commits · 6a2d98b18900002b6d24c4c3850c1c2467d13898 · Kirill Smelkov / linux

29 Jul, 2021 35 commits

mctp: Add MCTP overview document · 6a2d98b1

Jeremy Kerr authored Jul 29, 2021

This change adds a brief document about the sockets API provided for
sending and receiving MCTP messages from userspace.

This is roughly based on the OpenBMC design document, at:

https://github.com/openbmc/docs/blob/master/designs/mctp/mctp-kernel.mdSigned-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a2d98b1

mctp: Allow per-netns default networks · 03f2bbc4

Matt Johnston authored Jul 29, 2021

Currently we have a compile-time default network
(MCTP_INITIAL_DEFAULT_NET). This change introduces a default_net field
on the net namespace, allowing future configuration for new interfaces.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

03f2bbc4

mctp: Add dest neighbour lladdr to route output · 26ab3fca

Matt Johnston authored Jul 29, 2021

Now that we have a neighbour implementation, hook it up to the output
path to set the dest hardware address for outgoing packets.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

26ab3fca

mctp: Implement message fragmentation & reassembly · 4a992bbd

Jeremy Kerr authored Jul 29, 2021

This change implements MCTP fragmentation (based on route & device MTU),
and corresponding reassembly.

The MCTP specification only allows for fragmentation on the originating
message endpoint, and reassembly on the destination endpoint -
intermediate nodes do not need to reassemble/refragment. Consequently,
we only fragment in the local transmit path, and reassemble
locally-bound packets. Messages are required to be in-order, so we
simply cancel reassembly on out-of-order or missing packets.

In the fragmentation path, we just break up the message into MTU-sized
fragments; the skb structure is a simple copy for now, which we can later
improve with a shared data implementation.

For reassembly, we keep track of incoming message fragments using the
existing tag infrastructure, allocating a key on the (src,dest,tag)
tuple, and reassembles matching fragments into a skb->frag_list.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a992bbd

mctp: Populate socket implementation · 833ef3b9

Jeremy Kerr authored Jul 29, 2021

Start filling-out the socket syscalls: bind, sendmsg & recvmsg.

This requires an input route implementation, so we add to
mctp_route_input, allowing lookups on binds & message tags. This just
handles single-packet messages at present, we will add fragmentation in
a future change.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

833ef3b9

mctp: Add neighbour netlink interface · 831119f8

Matt Johnston authored Jul 29, 2021

This change adds the netlink interfaces for manipulating the MCTP
neighbour table.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

831119f8

mctp: Add neighbour implementation · 4d8b9319

Matt Johnston authored Jul 29, 2021

Add an initial neighbour table implementation, to be used in the route
output path.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

4d8b9319

mctp: Add netlink route management · 06d2f4c5

Matt Johnston authored Jul 29, 2021

This change adds RTM_GETROUTE, RTM_NEWROUTE & RTM_DELROUTE handlers,
allowing management of the MCTP route table.

Includes changes from Jeremy Kerr <jk@codeconstruct.com.au>.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

06d2f4c5

mctp: Add initial routing framework · 889b7da2

Jeremy Kerr authored Jul 29, 2021

Add a simple routing table, and a couple of route output handlers, and
the mctp packet_type & handler.

Includes changes from Matt Johnston <matt@codeconstruct.com.au>.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

889b7da2

mctp: Add device handling and netlink interface · 583be982

Jeremy Kerr authored Jul 29, 2021

This change adds the infrastructure for managing MCTP netdevices; we add
a pointer to the AF_MCTP-specific data to struct netdevice, and hook up
the rtnetlink operations for adding and removing addresses.

Includes changes from Matt Johnston <matt@codeconstruct.com.au>.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

583be982

mctp: Add initial driver infrastructure · 4b2e6930

Jeremy Kerr authored Jul 29, 2021

Add an empty drivers/net/mctp/, for future interface drivers.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

4b2e6930

mctp: Add sockaddr_mctp to uapi · 60fc6398

Jeremy Kerr authored Jul 29, 2021

This change introduces the user-visible MCTP header, containing the
protocol-specific addressing definitions.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

60fc6398

mctp: Add base packet definitions · 2c8e2e9a

Jeremy Kerr authored Jul 29, 2021

Simple packet header format as defined by DMTF DSP0236.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c8e2e9a

mctp: Add base socket/protocol definitions · 8f601a1e

Jeremy Kerr authored Jul 29, 2021

Add an empty socket implementation, plus initialisation/destruction
handlers.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

8f601a1e

mctp: Add MCTP base · bc49d816

Jeremy Kerr authored Jul 29, 2021

Add basic Kconfig, an initial (empty) af_mctp source object, and
{AF,PF}_MCTP definitions, and the required definitions for a new
protocol type.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

bc49d816

Merge branch 'nfc-const' · 658e6b16

David S. Miller authored Jul 29, 2021

Krzysztof Kozlowski says:

====================
nfc: constify, continued (part 2)

On top of:
nfc: constify pointed data
https://lore.kernel.org/lkml/20210726145224.146006-1-krzysztof.kozlowski@canonical.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

658e6b16

nfc: mrvl: constify static nfcmrvl_if_ops · 26955037

Krzysztof Kozlowski authored Jul 29, 2021

File-scope struct nfcmrvl_if_ops is not modified so can be made const.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

26955037

nfc: mrvl: constify several pointers · fe53159f

Krzysztof Kozlowski authored Jul 29, 2021

Several functions do not modify pointed data so arguments and local
variables can be const for correctness and safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fe53159f

nfc: microread: constify several pointers · a751449f

Krzysztof Kozlowski authored Jul 29, 2021

Several functions do not modify pointed data so arguments and local
variables can be const for correctness and safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a751449f

nfc: fdp: constify several pointers · 3d463dd5

Krzysztof Kozlowski authored Jul 29, 2021

Several functions do not modify pointed data so arguments and local
variables can be const for correctness and safety. This allows also
making file-scope nci_core_get_config_otp_ram_version array const.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3d463dd5

nfc: fdp: use unsigned int as loop iterator · c3e26b6d

Krzysztof Kozlowski authored Jul 29, 2021

Loop iterators are simple integers, no point to optimize the size and
use u8.  It only raises the question whether the variable is used in
some other context.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3e26b6d

nfc: fdp: drop unneeded cast for printing firmware size in dev_dbg() · 6c755b1d

Krzysztof Kozlowski authored Jul 29, 2021

Size of firmware is a type of size_t, so print it directly instead of
casting to int.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c755b1d

nfc: nfcsim: constify drvdata (struct nfcsim) · 582fdc98

Krzysztof Kozlowski authored Jul 29, 2021

nfcsim_abort_cmd() does not modify struct nfcsim, so local variable
can be a pointer to const.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

582fdc98

nfc: virtual_ncidev: constify pointer to nfc_dev · 83428dbb

Krzysztof Kozlowski authored Jul 29, 2021

virtual_ncidev_ioctl() does not modify struct nfc_dev, so local variable
can be a pointer to const.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

83428dbb

nfc: trf7970a: constify several pointers · ea050c5e

Krzysztof Kozlowski authored Jul 29, 2021

Several functions do not modify pointed data so arguments and local
variables can be const for correctness and safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ea050c5e

nfc: port100: constify several pointers · 9a4af01c

Krzysztof Kozlowski authored Jul 29, 2021

Several functions do not modify pointed data so arguments and local
variables can be const for correctness and safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9a4af01c

nfc: mei_phy: constify buffer passed to mei_nfc_send() · 894a6e15

Krzysztof Kozlowski authored Jul 29, 2021

The buffer passed to mei_nfc_send() can be const for correctness and
safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

894a6e15

nfc: constify passed nfc_dev · dd8987a3

Krzysztof Kozlowski authored Jul 29, 2021

The struct nfc_dev is not modified by nfc_get_drvdata() and
nfc_device_name() so it can be made a const.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dd8987a3

Merge branch 'skb-gro-optimize' · 8cb79af5

David S. Miller authored Jul 29, 2021

Paolo Abeni says:

====================
sk_buff: optimize GRO for the common case

This is a trimmed down revision of "sk_buff: optimize layout for GRO",
specifically dropping the changes to the sk_buff layout[1].

This series tries to accomplish 2 goals:
- optimize the GRO stage for the most common scenario, avoiding a bunch
  of conditional and some more code
- let owned skbs entering the GRO engine, allowing backpressure in the
  veth GRO forward path.

A new sk_buff flag (!!!) is introduced and maintained for GRO's sake.
Such field uses an existing hole, so there is no change to the sk_buff
size.

[1] two main reasons:
- move skb->inner_ field requires some extra care, as some in kernel
  users access and the fields regardless of skb->encapsulation.
- extending secmark size clash with ct and nft uAPIs

address the all above is possible, I think, but for sure not in a single
series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8cb79af5

veth: use skb_prepare_for_gro() · d504fff0

Paolo Abeni authored Jul 28, 2021

Leveraging the previous patch we can now avoid orphaning the
skb in the veth gro path, allowing correct backpressure.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d504fff0

skbuff: allow 'slow_gro' for skb carring sock reference · 5e10da53

Paolo Abeni authored Jul 28, 2021

This change leverages the infrastructure introduced by the previous
patches to allow soft devices passing to the GRO engine owned skbs
without impacting the fast-path.

It's up to the GRO caller ensuring the slow_gro bit validity before
invoking the GRO engine. The new helper skb_prepare_for_gro() is
introduced for that goal.

On slow_gro, skbs are aggregated only with equal sk.
Additionally, skb truesize on GRO recycle and free is correctly
updated so that sk wmem is not changed by the GRO processing.

rfc-> v1:
 - fixed bad truesize on dev_gro_receive NAPI_FREE
 - use the existing state bit
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e10da53

net: optimize GRO for the common case. · 9efb4b5b

Paolo Abeni authored Jul 28, 2021

After the previous patches, at GRO time, skb->slow_gro is
usually 0, unless the packets comes from some H/W offload
slowpath or tunnel.

We can optimize the GRO code assuming !skb->slow_gro is likely.
This remove multiple conditionals in the most common path, at the
price of an additional one when we hit the above "slow-paths".
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9efb4b5b

sk_buff: track extension status in slow_gro · b0999f38

Paolo Abeni authored Jul 28, 2021

Similar to the previous one, but tracking the
active_extensions field status.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b0999f38

sk_buff: track dst status in slow_gro · 8a886b14

Paolo Abeni authored Jul 28, 2021

Similar to the previous patch, but covering the dst field:
the slow_gro flag is additionally set when a dst is attached
to the skb

RFC -> v1:
 - use the existing flag instead of adding a new one
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8a886b14

sk_buff: introduce 'slow_gro' flags · 5fc88f93

Paolo Abeni authored Jul 28, 2021

The new flag tracks if any state field is set, so that
GRO requires 'unusual'/slow prepare steps.

Set such flag when a ct entry is attached to the skb,
and never clear it.

The new bit uses an existing hole into the sk_buff struct

RFC -> v1:
 - use a single state bit, never clear it
 - avoid moving the _nfct field
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5fc88f93

28 Jul, 2021 5 commits

Documentation: networking: add ioam6-sysctl into index · 883d71a5

Hu Haowen authored Jul 28, 2021

Append ioam6-sysctl to toctree in order to get rid of building warnings.
Signed-off-by: Hu Haowen <src.res@email.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

883d71a5

net: dsa: sja1105: be stateless when installing FDB entries · b11f0a4c

Vladimir Oltean authored Jul 28, 2021

Currently there are issues when adding a bridge FDB entry as VLAN-aware
and deleting it as VLAN-unaware, or vice versa.

However this is an unneeded complication, since the bridge always
installs its default FDB entries in VLAN 0 to match on VLAN-unaware
ports, and in the default_pvid (VLAN 1) to match on VLAN-aware ports.
So instead of trying to outsmart the bridge, just install all entries it
gives us, and they will start matching packets when the vlan_filtering
mode changes.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b11f0a4c

Merge branch 'switchdev-notifiers' · b0fdb999

David S. Miller authored Jul 28, 2021

Vladimir Oltean says:

====================
Plug the last 2 holes in the switchdev notifiers for local FDB entries

The work for trapping local FDB entries to the CPU in switchdev/DSA
started with the "RX filtering in DSA" series:
https://patchwork.kernel.org/project/netdevbpf/cover/20210629140658.2510288-1-olteanv@gmail.com/
and was continued with further improvements such as "Fan out FDB entries
pointing towards the bridge to all switchdev member ports":
https://patchwork.kernel.org/project/netdevbpf/cover/20210719135140.278938-1-vladimir.oltean@nxp.com/
https://patchwork.kernel.org/project/netdevbpf/cover/20210720173557.999534-1-vladimir.oltean@nxp.com/

There are only 2 more issues left to be addressed (famous last words),
and these are:
- dynamically learned FDB entries towards interfaces foreign to DSA need
  to be replayed too
- adding/deleting a VLAN on a port causes the local FDB entries in that
  VLAN to be prematurely deleted

This patch series addresses both, and patch 2 depends on 1 to work properly.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b0fdb999

net: bridge: switchdev: treat local FDBs the same as entries towards the bridge · 52e4bec1

Vladimir Oltean authored Jul 28, 2021

Currently the following script:

1. ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up
2. ip link set swp2 up && ip link set swp2 master br0
3. ip link set swp3 up && ip link set swp3 master br0
4. ip link set swp4 up && ip link set swp4 master br0
5. bridge vlan del dev swp2 vid 1
6. bridge vlan del dev swp3 vid 1
7. ip link set swp4 nomaster
8. ip link set swp3 nomaster

produces the following output:

[ 641.010738] sja1105 spi0.1: port 2 failed to delete 00:1f:7b:63:02:48 vid 1 from fdb: -2

[ swp2, swp3 and br0 all have the same MAC address, the one listed above ]

In short, this happens because the number of FDB entry additions
notified to switchdev is unbalanced with the number of deletions.

At step 1, the bridge has a random MAC address. At step 2, the
br_fdb_replay of swp2 receives this initial MAC address. Then the bridge
inherits the MAC address of swp2 via br_fdb_change_mac_address(), and it
notifies switchdev (only swp2 at this point) of the deletion of the
random MAC address and the addition of 00:1f:7b:63:02:48 as a local FDB
entry with fdb->dst == swp2, in VLANs 0 and the default_pvid (1).

During step 7:

del_nbp
-> br_fdb_delete_by_port(br, p, vid=0, do_all=1);
-> fdb_delete_local(br, p, f);

br_fdb_delete_by_port() deletes all entries towards the ports,
regardless of vid, because do_all is 1.

fdb_delete_local() has logic to migrate local FDB entries deleted from
one port to another port which shares the same MAC address and is in the
same VLAN, or to the bridge device itself. This migration happens
without notifying switchdev of the deletion on the old port and the
addition on the new one, just fdb->dst is changed and the added_by_user
flag is cleared.

In the example above, the del_nbp(swp4) causes the
"addr 00:1f:7b:63:02:48 vid 1" local FDB entry with fdb->dst == swp4
that existed up until then to be migrated directly towards the bridge
(fdb->dst == NULL). This is because it cannot be migrated to any of the
other ports (swp2 and swp3 are not in VLAN 1).

After the migration to br0 takes place, swp4 requests a deletion replay
of all FDB entries. Since the "addr 00:1f:7b:63:02:48 vid 1" entry now
point towards the bridge, a deletion of it is replayed. There was just
a prior addition of this address, so the switchdev driver deletes this
entry.

Then, the del_nbp(swp3) at step 8 triggers another br_fdb_replay, and
switchdev is notified again to delete "addr 00:1f:7b:63:02:48 vid 1".
But it can't because it no longer has it, so it returns -ENOENT.

There are other possibilities to trigger this issue, but this is by far
the simplest to explain.

To fix this, we must avoid the situation where the addition of an FDB
entry is notified to switchdev as a local entry on a port, and the
deletion is notified on the bridge itself.

Considering that the 2 types of FDB entries are completely equivalent
and we cannot have the same MAC address as a local entry on 2 bridge
ports, or on a bridge port and pointing towards the bridge at the same
time, it makes sense to hide away from switchdev completely the fact
that a local FDB entry is associated with a given bridge port at all.
Just say that it points towards the bridge, it should make no difference
whatsoever to the switchdev driver and should even lead to a simpler
overall implementation, will less cases to handle.

This also avoids any modification at all to the core bridge driver, just
what is reported to switchdev changes. With the local/permanent entries
on bridge ports being already reported to user space, it is hard to
believe that the bridge behavior can change in any backwards-incompatible
way such as making all local FDB entries point towards the bridge.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

52e4bec1

net: bridge: switchdev: replay the entire FDB for each port · b4454bc6

Vladimir Oltean authored Jul 28, 2021

Currently when a switchdev port joins a bridge, we replay all FDB
entries pointing towards that port or towards the bridge.

However, this is insufficient in certain situations:

(a) DSA, through its assisted_learning_on_cpu_port logic, snoops
    dynamically learned FDB entries on foreign interfaces.
    These are FDB entries that are pointing neither towards the newly
    joined switchdev port, nor towards the bridge. So these addresses
    would be missed when joining a bridge where a foreign interface has
    already learned some addresses, and they would also linger on if the
    DSA port leaves the bridge before the foreign interface forgets them.
    None of this happens if we replay the entire FDB when the port joins.

(b) There is a desire to treat local FDB entries on a port (i.e. the
    port's termination MAC address) identically to FDB entries pointing
    towards the bridge itself. More details on the reason behind this in
    the next patch. The point is that this cannot be done given the
    current structure of br_fdb_replay() in this situation:
      ip link set swp0 master br0  # br0 inherits its MAC address from swp0
      ip link set swp1 master br0
    What is desirable is that when swp1 joins the bridge, br_fdb_replay()
    also notifies swp1 of br0's MAC address, but this won't in fact
    happen because the MAC address of br0 does not have fdb->dst == NULL
    (it doesn't point towards the bridge), but it has fdb->dst == swp0.
    So our current logic makes it impossible for that address to be
    replayed. But if we dump the entire FDB instead of just the entries
    with fdb->dst == swp1 and fdb->dst == NULL, then the inherited MAC
    address of br0 will be replayed too, which is what we need.

A natural question arises: say there is an FDB entry to be replayed,
like a MAC address dynamically learned on a foreign interface that
belongs to a bridge where no switchdev port has joined yet. If 10
switchdev ports belonging to the same driver join this bridge, one by
one, won't every port get notified 10 times of the foreign FDB entry,
amounting to a total of 100 notifications for this FDB entry in the
switchdev driver?

Well, yes, but this is where the "void *ctx" argument for br_fdb_replay
is useful: every port of the switchdev driver is notified whenever any
other port requests an FDB replay, but because the replay was initiated
by a different port, its context is different from the initiating port's
context, so it ignores those replays.

So the foreign FDB entry will be installed only 10 times, once per port.
This is done so that the following 4 code paths are always well balanced:
(a) addition of foreign FDB entry is replayed when port joins bridge
(b) deletion of foreign FDB entry is replayed when port leaves bridge
(c) addition of foreign FDB entry is notified to all ports currently in bridge
(c) deletion of foreign FDB entry is notified to all ports currently in bridge
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b4454bc6