Commits · 5601ef91fba8efea483a7ca3bd6c42a581e4daa1 · Kirill Smelkov / linux

02 Apr, 2023 3 commits

mlxsw: core_thermal: Use static trip points for transceiver modules · 5601ef91

Ido Schimmel authored Mar 31, 2023

The driver registers a thermal zone for each transceiver module and
tries to set the trip point temperatures according to the thresholds
read from the transceiver. If a threshold cannot be read or if a
transceiver is unplugged, the trip point temperature is set to zero,
which means that it is disabled as far as the thermal subsystem is
concerned.

A recent change in the thermal core made it so that such trip points are
no longer marked as disabled, which lead the thermal subsystem to
incorrectly set the associated cooling devices to the their maximum
state [1]. A fix to restore this behavior was merged in commit
f1b80a38 ("thermal: core: Restore behavior regarding invalid trip
points"). However, the thermal maintainer suggested to not rely on this
behavior and instead always register a valid array of trip points [2].

Therefore, create a static array of trip points with sane defaults
(suggested by Vadim) and register it with the thermal zone of each
transceiver module. User space can choose to override these defaults
using the thermal zone sysfs interface since these files are writeable.

Before:

 $ cat /sys/class/thermal/thermal_zone11/type
 mlxsw-module11
 $ cat /sys/class/thermal/thermal_zone11/trip_point_*_temp
 65000
 75000
 80000

After:

 $ cat /sys/class/thermal/thermal_zone11/type
 mlxsw-module11
 $ cat /sys/class/thermal/thermal_zone11/trip_point_*_temp
 55000
 65000
 80000

Also tested by reverting commit f1b80a38 ("thermal: core: Restore
behavior regarding invalid trip points") and making sure that the
associated cooling devices are not set to their maximum state.

[1] https://lore.kernel.org/linux-pm/ZA3CFNhU4AbtsP4G@shredder/
[2] https://lore.kernel.org/linux-pm/f78e6b70-a963-c0ca-a4b2-0d4c6aeef1fb@linaro.org/Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5601ef91

net: minor reshuffle of napi_struct · dd2d6604

Jakub Kicinski authored Mar 30, 2023

napi_id is read by GRO and drivers to mark skbs, and it currently
sits at the end of the structure, in a mostly unused cache line.
Move it up into a hole, and separate the clearly control path
fields from the important ones.

Before:

struct napi_struct {
	struct list_head           poll_list;            /*     0    16 */
	long unsigned int          state;                /*    16     8 */
	int                        weight;               /*    24     4 */
	int                        defer_hard_irqs_count; /*    28     4 */
	long unsigned int          gro_bitmask;          /*    32     8 */
	int                        (*poll)(struct napi_struct *, int); /*    40     8 */
	int                        poll_owner;           /*    48     4 */

	/* XXX 4 bytes hole, try to pack */

	struct net_device *        dev;                  /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct gro_list            gro_hash[8];          /*    64   192 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	struct sk_buff *           skb;                  /*   256     8 */
	struct list_head           rx_list;              /*   264    16 */
	int                        rx_count;             /*   280     4 */

	/* XXX 4 bytes hole, try to pack */

	struct hrtimer             timer;                /*   288    64 */

	/* XXX last struct has 4 bytes of padding */

	/* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */
	struct list_head           dev_list;             /*   352    16 */
	struct hlist_node          napi_hash_node;       /*   368    16 */
	/* --- cacheline 6 boundary (384 bytes) --- */
	unsigned int               napi_id;              /*   384     4 */

	/* XXX 4 bytes hole, try to pack */

	struct task_struct *       thread;               /*   392     8 */

	/* size: 400, cachelines: 7, members: 17 */
	/* sum members: 388, holes: 3, sum holes: 12 */
	/* paddings: 1, sum paddings: 4 */
	/* last cacheline: 16 bytes */
};

After:

struct napi_struct {
	struct list_head           poll_list;            /*     0    16 */
	long unsigned int          state;                /*    16     8 */
	int                        weight;               /*    24     4 */
	int                        defer_hard_irqs_count; /*    28     4 */
	long unsigned int          gro_bitmask;          /*    32     8 */
	int                        (*poll)(struct napi_struct *, int); /*    40     8 */
	int                        poll_owner;           /*    48     4 */

	/* XXX 4 bytes hole, try to pack */

	struct net_device *        dev;                  /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct gro_list            gro_hash[8];          /*    64   192 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	struct sk_buff *           skb;                  /*   256     8 */
	struct list_head           rx_list;              /*   264    16 */
	int                        rx_count;             /*   280     4 */
	unsigned int               napi_id;              /*   284     4 */
	struct hrtimer             timer;                /*   288    64 */

	/* XXX last struct has 4 bytes of padding */

	/* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */
	struct task_struct *       thread;               /*   352     8 */
	struct list_head           dev_list;             /*   360    16 */
	struct hlist_node          napi_hash_node;       /*   376    16 */

	/* size: 392, cachelines: 7, members: 17 */
	/* sum members: 388, holes: 1, sum holes: 4 */
	/* paddings: 1, sum paddings: 4 */
	/* forced alignments: 1 */
	/* last cacheline: 8 bytes */
} __attribute__((__aligned__(8)));
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dd2d6604

i40e: Add support for VF to specify its primary MAC address · ceb29474

Sylwester Dziedziuch authored Mar 30, 2023

Currently in the i40e driver there is no implementation of different
MAC address handling depending on whether it is a legacy or primary.
Introduce new checks for VF to be able to specify its primary MAC
address based on the VIRTCHNL_ETHER_ADDR_PRIMARY type.

Primary MAC address are treated differently compared to legacy
ones in a scenario where:
1. If a unicast MAC is being added and it's specified as
VIRTCHNL_ETHER_ADDR_PRIMARY, then replace the current
default_lan_addr.addr.
2. If a unicast MAC is being deleted and it's type
is specified as VIRTCHNL_ETHER_ADDR_PRIMARY, then zero the
hw_lan_addr.addr.
Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ceb29474

01 Apr, 2023 1 commit

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · d74aab2c

Jakub Kicinski authored Mar 31, 2023

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2023-03-30 (documentation, ice)

This series contains updates to driver documentation and the ice driver.

Tony removes links and addresses related to the out-of-tree driver from the
Intel ethernet driver documentation.

Jake removes a comment that is no longer valid to the ice driver.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: remove comment about not supporting driver reinit
  Documentation/eth/intel: Remove references to SourceForge
  Documentation/eth/intel: Update address for driver support
====================

Link: https://lore.kernel.org/r/20230330165935.2503604-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d74aab2c

31 Mar, 2023 20 commits

Merge tag 'nf-next-2023-03-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next · 54fd494a

Jakub Kicinski authored Mar 31, 2023

Florian Westphal says:

====================
netfilter updates for net-next

1. No need to disable BH in nfnetlink proc handler, freeing happens
   via call_rcu.
2. Expose classid in nfetlink_queue, from Eric Sage.
3. Fix nfnetlink message description comments, from Matthieu De Beule.
4. Allow removal of offloaded connections via ctnetlink, from Paul Blakey.

* tag 'nf-next-2023-03-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: ctnetlink: Support offloaded conntrack entry deletion
  netfilter: Correct documentation errors in nf_tables.h
  netfilter: nfnetlink_queue: enable classid socket info retrieval
  netfilter: nfnetlink_log: remove rcu_bh usage
====================

Link: https://lore.kernel.org/r/20230331104809.2959-1-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

54fd494a

dt-bindings: net: fec: add power-domains property · 99b3a769

Peng Fan authored Mar 28, 2023

Add optional power domains property
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230328061518.1985981-1-peng.fan@oss.nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

99b3a769

tcp: Refine SYN handling for PAWS. · ee05d90d

Kuniyuki Iwashima authored Mar 29, 2023

Our Network Load Balancer (NLB) [0] has multiple nodes with different
IP addresses, and each node forwards TCP flows from clients to backend
targets. NLB has an option to preserve the client's source IP address
and port when routing packets to backend targets. [1]

When a client connects to two different NLB nodes, they may select the
same backend target. Then, if the client has used the same source IP
and port, the two flows at the backend side will have the same 4-tuple.

While testing around such cases, I saw these sequences on the backend
target.

IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2819965599, win 62727, options [mss 8365,sackOK,TS val 1029816180 ecr 0,nop,wscale 7], length 0
IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 3040695044, ack 2819965600, win 62643, options [mss 8961,sackOK,TS val 1224784076 ecr 1029816180,nop,wscale 7], length 0
IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 1, win 491, options [nop,nop,TS val 1029816181 ecr 1224784076], length 0
IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2681819307, win 62727, options [mss 8365,sackOK,TS val 572088282 ecr 0,nop,wscale 7], length 0
IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [.], ack 1, win 490, options [nop,nop,TS val 1224794914 ecr 1029816181,nop,nop,sack 1 {4156821004:4156821005}], length 0

It seems to be working correctly, but the last ACK was generated by
tcp_send_dupack() and PAWSEstab was increased. This is because the
second connection has a smaller timestamp than the first one.

In this case, we should send a dup ACK in tcp_send_challenge_ack()
to increase the correct counter and rate-limit it properly.

Let's check the SYN flag after the PAWS tests to avoid adding unnecessary
overhead for most packets.

Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html [0]
Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#client-ip-preservation [1]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee05d90d

macvlan: Fix mc_filter calculation · ae63ad9b

Herbert Xu authored Mar 29, 2023

On Wed, Mar 29, 2023 at 08:10:26AM +0000, patchwork-bot+netdevbpf@kernel.org wrote:
>
> Here is the summary with links:
>   - [1/2] macvlan: Skip broadcast queue if multicast with single receiver
>     https://git.kernel.org/netdev/net-next/c/d45276e75e90
>   - [2/2] macvlan: Add netlink attribute for broadcast cutoff
>     https://git.kernel.org/netdev/net-next/c/954d1fa1ac93

Sorry, I made an error and posted my patches from an earlier
revision so a follow-up fix was missing:

---8<---
The bc_cutoff patch broke the calculation of mc_filter causing
some multicast packets to not make it through to the targeted
device.

Fix this by checking whether vlan is set instead of cutoff >= 0.

Also move the cutoff < 0 logic into macvlan_recompute_bc_filter
so that it doesn't change the mc_filter at all.

Fixes: d45276e7 ("macvlan: Skip broadcast queue if multicast with single receiver")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

ae63ad9b

Merge tag 'wireless-next-2023-03-30' of... · ce7928f7

Jakub Kicinski authored Mar 30, 2023

Merge tag 'wireless-next-2023-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Major stack changes:

 * TC offload support for drivers below mac80211
 * reduced neighbor report (RNR) handling for AP mode
 * mac80211 mesh fast-xmit and fast-rx support
 * support for another mesh A-MSDU format
   (seems nobody got the spec right)

Major driver changes:

Kalle moved the drivers that were just plain C files
in drivers/net/wireless/ to legacy/ and virtual/ dirs.

hwsim
 * multi-BSSID support
 * some FTM support

ath11k
 * MU-MIMO parameters support
 * ack signal support for management packets

rtl8xxxu
 * support for RTL8710BU aka RTL8188GU chips

rtw89
 * support for various newer firmware APIs

ath10k
 * enabled threaded NAPI on WCN3990

iwlwifi
 * lots of work for multi-link/EHT (wifi7)
 * hardware timestamping support for some devices/firwmares
 * TX beacon protection on newer hardware

* tag 'wireless-next-2023-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (181 commits)
  wifi: clean up erroneously introduced file
  wifi: iwlwifi: mvm: correctly use link in iwl_mvm_sta_del()
  wifi: iwlwifi: separate AP link management queues
  wifi: iwlwifi: mvm: free probe_resp_data later
  wifi: iwlwifi: bump FW API to 75 for AX devices
  wifi: iwlwifi: mvm: move max_agg_bufsize into host TLC lq_sta
  wifi: iwlwifi: mvm: send full STA during HW restart
  wifi: iwlwifi: mvm: rework active links counting
  wifi: iwlwifi: mvm: update mac config when assigning chanctx
  wifi: iwlwifi: mvm: use the correct link queue
  wifi: iwlwifi: mvm: clean up mac_id vs. link_id in MLD sta
  wifi: iwlwifi: mvm: fix station link data leak
  wifi: iwlwifi: mvm: initialize max_rc_amsdu_len per-link
  wifi: iwlwifi: mvm: use appropriate link for rate selection
  wifi: iwlwifi: mvm: use the new lockdep-checking macros
  wifi: iwlwifi: mvm: remove chanctx WARN_ON
  wifi: iwlwifi: mvm: avoid sending MAC context for idle
  wifi: iwlwifi: mvm: remove only link-specific AP keys
  wifi: iwlwifi: mvm: skip inactive links
  wifi: iwlwifi: mvm: adjust iwl_mvm_scan_respect_p2p_go_iter() for MLO
  ...
====================

Link: https://lore.kernel.org/r/20230330205612.921134-1-johannes@sipsolutions.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ce7928f7

Merge branch 'tools-ynl-fill-in-some-gaps-of-ethtool-spec' · dee1efb3

Jakub Kicinski authored Mar 30, 2023

Stanislav Fomichev says:

====================
tools: ynl: fill in some gaps of ethtool spec

I was trying to fill in the spec while exploring ethtool API for some
related work. I don't think I'll have the patience to fill in the rest,
so decided to share whatever I currently have.

Patches 1-2 add the be16 + spec.
Patches 3-4 implement an ethtool-like python tool to test the spec.

Patches 3-4 are there because it felt more fun do the tool instead
of writing the actual tests; feel free to drop it; sharing mostly
to show that the spec is not a complete nonsense.

The spec is not 100% complete, see patch 2 for what's missing.
I was hoping to finish the stats-get message, but I'm too dump
to implement bitmask marshaling (multi-attr).
====================

Link: https://lore.kernel.org/r/20230329221655.708489-1-sdf@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

dee1efb3

tools: ynl: ethtool testing tool · f3d07b02

Stanislav Fomichev authored Mar 29, 2023

This is what I've been using to see whether the spec makes sense.
A small subset of getters (mostly the unprivileged ones) is implemented.
Some setters (channels) also work.
Setters for messages with bitmasks are not implemented.

Initially I was trying to make this tool look 1:1 like real ethtool,
but eventually gave up :-)

Sample output:

$ ./tools/net/ynl/ethtool enp0s31f6
Settings for enp0s31f6:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half
100baseT/Full 1000baseT/Full
Supported pause frame use: no
Supports auto-negotiation: yes
Supported FEC modes: Not reported
Speed: Unknown!
Duplex: Unknown! (255)
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 2
Transceiver: Internal
MDI-X: Unknown (auto)
Current message level: drv probe link
Link detected: no
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f3d07b02

tools: ynl: replace print with NlError · 48993e22

Stanislav Fomichev authored Mar 29, 2023

Instead of dumping the error on the stdout, make the callee and
opportunity to decide what to do with it. This is mostly for the
ethtool testing.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

48993e22

tools: ynl: populate most of the ethtool spec · a353318e

Stanislav Fomichev authored Mar 29, 2023

Things that are not implemented:
- cable tests
- bitmaks in the requests don't work (needs multi-attr support in ynl.py)
- stats-get seems to return nonsense (not passing a bitmask properly?)
- notifications are not tested
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a353318e

tools: ynl: support byte-order in cli · 9f7cc57f

Stanislav Fomichev authored Mar 29, 2023

Used by ethtool spec.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9f7cc57f

octeontx2-af: update type of prof fields in nix_aw_enq_req · 709d0b88

Simon Horman authored Mar 29, 2023

Update type of prof and prof_mask fields in nix_as_enq_req
from u64 to struct nix_bandprof_s, which is 128 bits wide.

This is to address warnings with compiling with gcc-12 W=1
regarding string fortification.

Although the union of which these fields are a member is 128bits
wide, and thus writing a 128bit entity is safe, the compiler flags
a problem as the field being written is only 64 bits wide.

  CC [M]  drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.o
scripts/Makefile.build:252: ./drivers/net/ethernet/marvell/octeontx2/nic/Makefile: otx2_dcbnl.o is added to multiple modules: rvu_nicpf rvu_nicvf
  CC [M]  drivers/net/ethernet/marvell/octeontx2/nic/otx2_dcbnl.o
  CC [M]  drivers/net/ethernet/marvell/octeontx2/nic/qos_sq.o
  CC [M]  drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.o
  CC [M]  drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.o
In file included from ./include/linux/string.h:254,
                 from ./include/linux/bitmap.h:11,
                 from ./include/linux/cpumask.h:12,
                 from ./arch/x86/include/asm/paravirt.h:17,
                 from ./arch/x86/include/asm/cpuid.h:62,
                 from ./arch/x86/include/asm/processor.h:19,
                 from ./arch/x86/include/asm/timex.h:5,
                 from ./include/linux/timex.h:67,
                 from ./include/linux/time32.h:13,
                 from ./include/linux/time.h:60,
                 from ./include/linux/stat.h:19,
                 from ./include/linux/module.h:13,
                 from drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c:8:
In function 'fortify_memcpy_chk',
    inlined from 'rvu_nix_blk_aq_enq_inst' at drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c:969:4:
./include/linux/fortify-string.h:529:25: error: call to '__read_overflow2_field' declared with attribute warning: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror=attribute-warning]
  529 |                         __read_overflow2_field(q_size_field, size);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In function 'fortify_memcpy_chk',
    inlined from 'rvu_nix_blk_aq_enq_inst' at drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c:984:4:
./include/linux/fortify-string.h:529:25: error: call to '__read_overflow2_field' declared with attribute warning: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror=attribute-warning]
  529 |                         __read_overflow2_field(q_size_field, size);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

Compile tested only!
Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230329112356.458072-1-horms@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

709d0b88

Merge branch 'net-sched-act_tunnel_key-add-support-for-tunnel_dont_fragment' · f76b9bba

Jakub Kicinski authored Mar 30, 2023

Davide Caratti says:

====================
net/sched: act_tunnel_key: add support for TUNNEL_DONT_FRAGMENT

- patch 1 extends TC tunnel_key action to add support for TUNNEL_DONT_FRAGMENT
- patch 2 extends tdc to skip tests when iproute2 support is missing
- patch 3 adds a tdc test case to verify functionality of the control plane
- patch 4 adds a net/forwarding test case to verify functionality of the data plane
====================

Link: https://lore.kernel.org/r/cover.1680082990.git.dcaratti@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f76b9bba

selftests: forwarding: add tunnel_key "nofrag" test case · 533a89b1

Davide Caratti authored Mar 29, 2023

Add a selftest that configures metadata tunnel encapsulation using the TC
"tunnel_key" action: it includes a test case for setting "nofrag" flag.

Example output:

 # selftests: net/forwarding: tc_tunnel_key.sh
 # TEST: tunnel_key nofrag (skip_hw)                                   [ OK ]
 # INFO: Could not test offloaded functionality
 ok 1 selftests: net/forwarding: tc_tunnel_key.sh
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

533a89b1

selftests: tc-testing: add tunnel_key "nofrag" test case · b8617f8e

Davide Caratti authored Mar 29, 2023

# ./tdc.py -e 6bda -l
 6bda: (actions, tunnel_key) Add tunnel_key action with nofrag option
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b8617f8e

selftests: tc-testing: add "depends_on" property to skip tests · 7f3f8640

Davide Caratti authored Mar 29, 2023

currently, users can skip individual test cases by means of writing

  "skip": "yes"

in the scenario file. Extend this functionality, introducing 'dependsOn':
it's optional property like "skip", but the value contains a command (for
example, a probe on iproute2 to check if it supports a specific feature).
If such property is present, tdc executes that command and skips the test
when the return value is non-zero.
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7f3f8640

net/sched: act_tunnel_key: add support for "don't fragment" · 2384127e

Davide Caratti authored Mar 29, 2023

extend "act_tunnel_key" to allow specifying TUNNEL_DONT_FRAGMENT.
Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2384127e

selftests: rtnetlink: Fix do_test_address_proto() · 46e9acb7

Petr Machata authored Mar 29, 2023

This selftest was introduced recently in the commit cited below. It misses
several check_err() invocations to actually verify that the previous
command succeeded. When these are added, the first one fails, because
besides the addresses added by hand, there can be a link-local address
added by the kernel. Adjust the check to expect at least three addresses
instead of exactly three, and add the missing check_err's.

Furthermore, the explanatory comments assume that the address with no
protocol is $addr2, when in fact it is $addr3. Update the comments.

Fixes: 6a414fd7 ("selftests: rtnetlink: Add an address proto test")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/53a579bc883e1bf2fe490d58427cf22c2d1aa21f.1680102695.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

46e9acb7

net: ethernet: ti: Fix format specifier in netcp_create_interface() · 3292004c

Nathan Chancellor authored Mar 29, 2023

After commit 3948b059 ("net: introduce a config option to tweak
MAX_SKB_FRAGS"), clang warns:

  drivers/net/ethernet/ti/netcp_core.c:2085:4: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
                          MAX_SKB_FRAGS);
                          ^~~~~~~~~~~~~
  include/linux/dev_printk.h:144:65: note: expanded from macro 'dev_err'
          dev_printk_index_wrap(_dev_err, KERN_ERR, dev, dev_fmt(fmt), ##__VA_ARGS__)
                                                                 ~~~     ^~~~~~~~~~~
  include/linux/dev_printk.h:110:23: note: expanded from macro 'dev_printk_index_wrap'
                  _p_func(dev, fmt, ##__VA_ARGS__);                       \
                               ~~~    ^~~~~~~~~~~
  include/linux/skbuff.h:352:23: note: expanded from macro 'MAX_SKB_FRAGS'
  #define MAX_SKB_FRAGS CONFIG_MAX_SKB_FRAGS
                        ^~~~~~~~~~~~~~~~~~~~
  ./include/generated/autoconf.h:11789:30: note: expanded from macro 'CONFIG_MAX_SKB_FRAGS'
  #define CONFIG_MAX_SKB_FRAGS 17
                               ^~
  1 warning generated.

Follow the pattern of the rest of the tree by changing the specifier to
'%u' and casting MAX_SKB_FRAGS explicitly to 'unsigned int', which
eliminates the warning.

Fixes: 3948b059 ("net: introduce a config option to tweak MAX_SKB_FRAGS")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20230329-net-ethernet-ti-wformat-v1-1-83d0f799b553@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3292004c

net: dsa: fix db type confusion in host fdb/mdb add/del · eb1ab765

Vladimir Oltean authored Mar 29, 2023

We have the following code paths:

Host FDB (unicast RX filtering):

dsa_port_standalone_host_fdb_add()   dsa_port_bridge_host_fdb_add()
               |                                     |
               +--------------+         +------------+
                              |         |
                              v         v
                         dsa_port_host_fdb_add()

dsa_port_standalone_host_fdb_del()   dsa_port_bridge_host_fdb_del()
               |                                     |
               +--------------+         +------------+
                              |         |
                              v         v
                         dsa_port_host_fdb_del()

Host MDB (multicast RX filtering):

dsa_port_standalone_host_mdb_add()   dsa_port_bridge_host_mdb_add()
               |                                     |
               +--------------+         +------------+
                              |         |
                              v         v
                         dsa_port_host_mdb_add()

dsa_port_standalone_host_mdb_del()   dsa_port_bridge_host_mdb_del()
               |                                     |
               +--------------+         +------------+
                              |         |
                              v         v
                         dsa_port_host_mdb_del()

The logic added by commit 5e8a1e03 ("net: dsa: install secondary
unicast and multicast addresses as host FDB/MDB") zeroes out
db.bridge.num if the switch doesn't support ds->fdb_isolation
(the majority doesn't). This is done for a reason explained in commit
c2693363 ("net: dsa: request drivers to perform FDB isolation").

Taking a single code path as example - dsa_port_host_fdb_add() - the
others are similar - the problem is that this function handles:
- DSA_DB_PORT databases, when called from
  dsa_port_standalone_host_fdb_add()
- DSA_DB_BRIDGE databases, when called from
  dsa_port_bridge_host_fdb_add()

So, if dsa_port_host_fdb_add() were to make any change on the
"bridge.num" attribute of the database, this would only be correct for a
DSA_DB_BRIDGE, and a type confusion for a DSA_DB_PORT bridge.

However, this bug is without consequences, for 2 reasons:

- dsa_port_standalone_host_fdb_add() is only called from code which is
  (in)directly guarded by dsa_switch_supports_uc_filtering(ds), and that
  function only returns true if ds->fdb_isolation is set. So, the code
  only executed for DSA_DB_BRIDGE databases.

- Even if the code was not dead for DSA_DB_PORT, we have the following
  memory layout:

struct dsa_bridge {
	struct net_device *dev;
	unsigned int num;
	bool tx_fwd_offload;
	refcount_t refcount;
};

struct dsa_db {
	enum dsa_db_type type;

	union {
		const struct dsa_port *dp; // DSA_DB_PORT
		struct dsa_lag lag;
		struct dsa_bridge bridge; // DSA_DB_BRIDGE
	};
};

So, the zeroization of dsa_db :: bridge :: num on a dsa_db structure of
type DSA_DB_PORT would access memory which is unused, because we only
use dsa_db :: dp for DSA_DB_PORT, and this is mapped at the same address
with dsa_db :: dev for DSA_DB_BRIDGE, thanks to the union definition.

It is correct to fix up dsa_db :: bridge :: num only from code paths
that come from the bridge / switchdev, so move these there.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230329133819.697642-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

eb1ab765

net: ksz884x: remove unused change variable · 9a865a98

Tom Rix authored Mar 29, 2023

clang with W=1 reports
drivers/net/ethernet/micrel/ksz884x.c:3216:6: error: variable
  'change' set but not used [-Werror,-Wunused-but-set-variable]
        int change = 0;
            ^
This variable is not used so remove it.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230329125929.1808420-1-trix@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9a865a98

30 Mar, 2023 16 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 79548b79

Jakub Kicinski authored Mar 30, 2023

Conflicts:

drivers/net/ethernet/mediatek/mtk_ppe.c
  3fbe4d8c ("net: ethernet: mtk_eth_soc: ppe: add support for flow accounting")
  92453132 ("net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

79548b79

Merge tag 'net-6.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · b2bc47e9

Linus Torvalds authored Mar 30, 2023

Pull networking fixes from Jakub Kicinski:
 "Including fixes from CAN and WPAN.

  Still quite a few bugs from this release. This pull is a bit smaller
  because major subtrees went into the previous one. Or maybe people
  took spring break off?

  Current release - regressions:

   - phy: micrel: correct KSZ9131RNX EEE capabilities and advertisement

  Current release - new code bugs:

   - eth: wangxun: fix vector length of interrupt cause

   - vsock/loopback: consistently protect the packet queue with
     sk_buff_head.lock

   - virtio/vsock: fix header length on skb merging

   - wpan: ca8210: fix unsigned mac_len comparison with zero

  Previous releases - regressions:

   - eth: stmmac: don't reject VLANs when IFF_PROMISC is set

   - eth: smsc911x: avoid PHY being resumed when interface is not up

   - eth: mtk_eth_soc: fix tx throughput regression with direct 1G links

   - eth: bnx2x: use the right build_skb() helper after core rework

   - wwan: iosm: fix 7560 modem crash on use on unsupported channel

  Previous releases - always broken:

   - eth: sfc: don't overwrite offload features at NIC reset

   - eth: r8169: fix RTL8168H and RTL8107E rx crc error

   - can: j1939: prevent deadlock by moving j1939_sk_errqueue()

   - virt: vmxnet3: use GRO callback when UPT is enabled

   - virt: xen: don't do grant copy across page boundary

   - phy: dp83869: fix default value for tx-/rx-internal-delay

   - dsa: ksz8: fix multiple issues with ksz8_fdb_dump

   - eth: mvpp2: fix classification/RSS of VLAN and fragmented packets

   - eth: mtk_eth_soc: fix flow block refcounting logic

  Misc:

   - constify fwnode pointers in SFP handling"

* tag 'net-6.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (55 commits)
  net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow
  net: ethernet: mtk_eth_soc: fix L2 offloading with DSA untag offload
  net: ethernet: mtk_eth_soc: fix flow block refcounting logic
  net: mvneta: fix potential double-frees in mvneta_txq_sw_deinit()
  net: dsa: sync unicast and multicast addresses for VLAN filters too
  net: dsa: mv88e6xxx: Enable IGMP snooping on user ports only
  xen/netback: use same error messages for same errors
  test/vsock: new skbuff appending test
  virtio/vsock: WARN_ONCE() for invalid state of socket
  virtio/vsock: fix header length on skb merging
  bnxt_en: Add missing 200G link speed reporting
  bnxt_en: Fix typo in PCI id to device description string mapping
  bnxt_en: Fix reporting of test result in ethtool selftest
  i40e: fix registers dump after run ethtool adapter self test
  bnx2x: use the right build_skb() helper
  net: ipa: compute DMA pool size properly
  net: wwan: iosm: fixes 7560 modem crash
  net: ethernet: mtk_eth_soc: fix tx throughput regression with direct 1G links
  ice: fix invalid check for empty list in ice_sched_assoc_vsi_to_agg()
  ice: add profile conflict check for AVF FDIR
  ...

b2bc47e9

Merge tag 'for-6.3/dm-fixes-2' of... · b527ac44

Linus Torvalds authored Mar 30, 2023

Merge tag 'for-6.3/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix two DM core bugs in the code that handles splitting "abnormal" IO
   (discards, write same and secure erase) and issuing that IO to the
   correct underlying devices (and offsets within those devices).

* tag 'for-6.3/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: fix __send_duplicate_bios() to always allow for splitting IO
  dm: fix improper splitting for abnormal bios

b527ac44

wifi: clean up erroneously introduced file · aa2aa818

Johannes Berg authored Mar 30, 2023

Evidently Gregory sent this file but I (apparently every else) missed
it entirely, remove that.

Fixes: cf85123a ("wifi: iwlwifi: mvm: support enabling and disabling HW timestamping")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

aa2aa818

Merge tag 'drm-fixes-2023-03-30' of git://anongit.freedesktop.org/drm/drm · 0d3ff808

Linus Torvalds authored Mar 30, 2023

Pull drm fixes from Daniel Vetter:
 "Two regression fixes in here, otherwise just the usual stuff:

   - i915 fixes for color mgmt, psr, lmem flush, hibernate oops, and
     more

   - amdgpu: dp mst and hibernate regression fix

   - etnaviv: revert fdinfo support (incl drm/sched revert), leak fix

   - misc ivpu fixes, nouveau backlight, drm buddy allocator 32bit
     fixes"

* tag 'drm-fixes-2023-03-30' of git://anongit.freedesktop.org/drm/drm: (27 commits)
  Revert "drm/scheduler: track GPU active time per entity"
  Revert "drm/etnaviv: export client GPU usage statistics via fdinfo"
  drm/etnaviv: fix reference leak when mmaping imported buffer
  drm/amdgpu: allow more APUs to do mode2 reset when go to S4
  drm/amd/display: Take FEC Overhead into Timeslot Calculation
  drm/amd/display: Add DSC Support for Synaptics Cascaded MST Hub
  drm: test: Fix 32-bit issue in drm_buddy_test
  drm: buddy_allocator: Fix buddy allocator init on 32-bit systems
  drm/nouveau/kms: Fix backlight registration
  drm/i915/perf: Drop wakeref on GuC RC error
  drm/i915/dpt: Treat the DPT BO as a framebuffer
  drm/i915/gem: Flush lmem contents after construction
  drm/i915/tc: Fix the ICL PHY ownership check in TC-cold state
  drm/i915: Disable DC states for all commits
  drm/i915: Workaround ICL CSC_MODE sticky arming
  drm/i915: Add a .color_post_update() hook
  drm/i915: Move CSC load back into .color_commit_arm() when PSR is enabled on skl/glk
  drm/i915: Split icl_color_commit_noarm() from skl_color_commit_noarm()
  drm/i915/pmu: Use functions common with sysfs to read actual freq
  accel/ivpu: Fix IPC buffer header status field value
  ...

0d3ff808

netfilter: ctnetlink: Support offloaded conntrack entry deletion · 9b7c68b3

Paul Blakey authored Mar 22, 2023

Currently, offloaded conntrack entries (flows) can only be deleted
after they are removed from offload, which is either by timeout,
tcp state change or tc ct rule deletion. This can cause issues for
users wishing to manually delete or flush existing entries.

Support deletion of offloaded conntrack entries.

Example usage:
 # Delete all offloaded (and non offloaded) conntrack entries
 # whose source address is 1.2.3.4
 $ conntrack -D -s 1.2.3.4
 # Delete all entries
 $ conntrack -F
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>

9b7c68b3

netfilter: Correct documentation errors in nf_tables.h · a25b8b71

Matthieu De Beule authored Mar 29, 2023

NFTA_RANGE_OP incorrectly says nft_cmp_ops instead of nft_range_ops.
NFTA_LOG_GROUP and NFTA_LOG_QTHRESHOLD claim NLA_U32 instead of NLA_U16
NFTA_EXTHDR_SREG isn't documented as a register
Signed-off-by: Matthieu De Beule <matthieu.debeule@proton.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>

a25b8b71

netfilter: nfnetlink_queue: enable classid socket info retrieval · 28c1b6df

Eric Sage authored Mar 27, 2023

This enables associating a socket with a v1 net_cls cgroup. Useful for
applying a per-cgroup policy when processing packets in userspace.
Signed-off-by: Eric Sage <eric_sage@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>

28c1b6df

netfilter: nfnetlink_log: remove rcu_bh usage · 356e2adb

Florian Westphal authored Mar 28, 2023

structure is free'd via call_rcu, so its safe to use rcu_read_lock only.

While at it, skip rcu_read_lock for lookup from packet path, its always
called with rcu held.
Signed-off-by: Florian Westphal <fw@strlen.de>

356e2adb

dm: fix __send_duplicate_bios() to always allow for splitting IO · 666eed46

Mike Snitzer authored Mar 30, 2023

Commit 7dd76d1f ("dm: improve bio splitting and associated IO
accounting") only called setup_split_accounting() from
__send_duplicate_bios() if a single bio were being issued. But the case
where duplicate bios are issued must call it too.

Otherwise the bio won't be split and resubmitted (via recursion through
block core back to DM) to submit the later portions of a bio (which may
map to an entirely different target).

For example, when discarding an entire DM striped device with the
following DM table:
 vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048
 vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048

Before (broken, discards the first striped target's devices twice):
 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
 device-mapper: striped: target_stripe=0, bdev=7:0, start=2049 len=22528
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=22528

After (works as expected):
 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872
 device-mapper: striped: target_stripe=0, bdev=7:2, start=2048 len=22528
 device-mapper: striped: target_stripe=1, bdev=7:3, start=2048 len=22528

Fixes: 7dd76d1f ("dm: improve bio splitting and associated IO accounting")
Cc: stable@vger.kernel.org
Reported-by: Orange Kao <orange@aiven.io>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

666eed46

dm: fix improper splitting for abnormal bios · f7b58a69

Mike Snitzer authored Mar 30, 2023

"Abnormal" bios include discards, write zeroes and secure erase. By no
longer passing the calculated 'len' pointer, commit 7dd06a25 ("dm:
allow dm_accept_partial_bio() for dm_io without duplicate bios") took a
senseless approach to disallowing dm_accept_partial_bio() from working
for duplicate bios processed using __send_duplicate_bios().

It inadvertently and incorrectly stopped the use of 'len' when
initializing a target's io (in alloc_tio). As such the resulting tio
could address more area of a device than it should.

For example, when discarding an entire DM striped device with the
following DM table:
 vg-lvol0: 0 159744 striped 2 128 7:0 2048 7:1 2048
 vg-lvol0: 159744 45056 striped 2 128 7:2 2048 7:3 2048

Before this fix:

 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=102400
 blkdiscard: attempt to access beyond end of device
 loop0: rw=2051, sector=2048, nr_sectors = 102400 limit=81920

 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=102400
 blkdiscard: attempt to access beyond end of device
 loop1: rw=2051, sector=2048, nr_sectors = 102400 limit=81920

After this fix;

 device-mapper: striped: target_stripe=0, bdev=7:0, start=2048 len=79872
 device-mapper: striped: target_stripe=1, bdev=7:1, start=2048 len=79872

Fixes: 7dd06a25 ("dm: allow dm_accept_partial_bio() for dm_io without duplicate bios")
Cc: stable@vger.kernel.org
Reported-by: Orange Kao <orange@aiven.io>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>

f7b58a69

net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow · 92453132

Felix Fietkau authored Mar 30, 2023

The cache needs to be flushed to ensure that the hardware stops offloading
the flow immediately.

Fixes: 33fc42de ("net: ethernet: mtk_eth_soc: support creating mac address based offload entries")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230330120840.52079-3-nbd@nbd.nameSigned-off-by: Jakub Kicinski <kuba@kernel.org>

92453132

net: ethernet: mtk_eth_soc: fix L2 offloading with DSA untag offload · 5f36ca1b

Felix Fietkau authored Mar 30, 2023

Check for skb metadata in order to detect the case where the DSA header
is not present.

Fixes: 2d7605a7 ("net: ethernet: mtk_eth_soc: enable hardware DSA untagging")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230330120840.52079-2-nbd@nbd.nameSigned-off-by: Jakub Kicinski <kuba@kernel.org>

5f36ca1b

net: ethernet: mtk_eth_soc: fix flow block refcounting logic · 8c1cb87c

Felix Fietkau authored Mar 30, 2023

Since we call flow_block_cb_decref on FLOW_BLOCK_UNBIND, we also need to
call flow_block_cb_incref for a newly allocated cb.
Also fix the accidentally inverted refcount check on unbind.

Fixes: 502e84e2 ("net: ethernet: mtk_eth_soc: add flow offloading support")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230330120840.52079-1-nbd@nbd.nameSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8c1cb87c

net: mvneta: fix potential double-frees in mvneta_txq_sw_deinit() · 2960a2d3

Russell King (Oracle) authored Mar 29, 2023

Reported on the Turris forum, mvneta provokes kernel warnings in the
architecture DMA mapping code when mvneta_setup_txqs() fails to
allocate memory. This happens because when mvneta_cleanup_txqs() is
called in the mvneta_stop() path, we leave pointers in the structure
that have been freed.

Then on mvneta_open(), we call mvneta_setup_txqs(), which starts
allocating memory. On memory allocation failure, mvneta_cleanup_txqs()
will walk all the queues freeing any non-NULL pointers - which includes
pointers that were previously freed in mvneta_stop().

Fix this by setting these pointers to NULL to prevent double-freeing
of the same memory.

Fixes: 2adb719d ("net: mvneta: Implement software TSO")
Link: https://forum.turris.cz/t/random-kernel-exceptions-on-hbl-tos-7-0/18865/8Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/E1phUe5-00EieL-7q@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2960a2d3

net: dsa: sync unicast and multicast addresses for VLAN filters too · 64fdc5f3

Vladimir Oltean authored Mar 29, 2023

If certain conditions are met, DSA can install all necessary MAC
addresses on the CPU ports as FDB entries and disable flooding towards
the CPU (we call this RX filtering).

There is one corner case where this does not work.

ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up
ip link set swp0 master br0 && ip link set swp0 up
ip link add link swp0 name swp0.100 type vlan id 100
ip link set swp0.100 up && ip addr add 192.168.100.1/24 dev swp0.100

Traffic through swp0.100 is broken, because the bridge turns on VLAN
filtering in the swp0 port (causing RX packets to be classified to the
FDB database corresponding to the VID from their 802.1Q header), and
although the 8021q module does call dev_uc_add() towards the real
device, that API is VLAN-unaware, so it only contains the MAC address,
not the VID; and DSA's current implementation of ndo_set_rx_mode() is
only for VID 0 (corresponding to FDB entries which are installed in an
FDB database which is only hit when the port is VLAN-unaware).

It's interesting to understand why the bridge does not turn on
IFF_PROMISC for its swp0 bridge port, and it may appear at first glance
that this is a regression caused by the logic in commit 2796d0c6
("bridge: Automatically manage port promiscuous mode."). After all,
a bridge port needs to have IFF_PROMISC by its very nature - it needs to
receive and forward frames with a MAC DA different from the bridge
ports' MAC addresses.

While that may be true, when the bridge is VLAN-aware *and* it has a
single port, there is no real reason to enable promiscuity even if that
is an automatic port, with flooding and learning (there is nowhere for
packets to go except to the BR_FDB_LOCAL entries), and this is how the
corner case appears. Adding a second automatic interface to the bridge
would make swp0 promisc as well, and would mask the corner case.

Given the dev_uc_add() / ndo_set_rx_mode() API is what it is (it doesn't
pass a VLAN ID), the only way to address that problem is to install host
FDB entries for the cartesian product of RX filtering MAC addresses and
VLAN RX filters.

Fixes: 7569459a ("net: dsa: manage flooding on the CPU ports")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230329151821.745752-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

64fdc5f3