1. 04 Jul, 2023 7 commits
    • David S. Miller's avatar
      Merge branch 'dsa-ll-fixes' · 95c41842
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      dsa: Fix mangled link-local MAC DAs with SJA1105 DSA
      
      The SJA1105 hardware tagging protocol is weird and will put DSA
      information (source port, switch ID) in the MAC DA of the packets sent
      to the CPU, and then send some additional (meta) packets which contain
      the original bytes from the previous packet's MAC DA.
      
      The tagging protocol driver contains logic to handle this, but the meta
      frames are optional functionality, and there are configurations when
      they aren't received (no PTP RX timestamping). Thus, the MAC DA from
      packets sent to the stack is not correct in all cases.
      
      Also, during testing it was found that the MAC DA patching procedure was
      incorrect.
      
      The investigation comes as a result of this discussion with Paolo:
      https://lore.kernel.org/netdev/f494387c8d55d9b1d5a3e88beedeeb448f2e6cc3.camel@redhat.com/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95c41842
    • Vladimir Oltean's avatar
      net: dsa: sja1105: always enable the send_meta options · a372d66a
      Vladimir Oltean authored
      incl_srcpt has the limitation, mentioned in commit b4638af8 ("net:
      dsa: sja1105: always enable the INCL_SRCPT option"), that frames with a
      MAC DA of 01:80:c2:xx:yy:zz will be received as 01:80:c2:00:00:zz unless
      PTP RX timestamping is enabled.
      
      The incl_srcpt option was initially unconditionally enabled, then that
      changed with commit 42824463 ("net: dsa: sja1105: Limit use of
      incl_srcpt to bridge+vlan mode"), then again with b4638af8 ("net:
      dsa: sja1105: always enable the INCL_SRCPT option"). Bottom line is that
      it now needs to be always enabled, otherwise the driver does not have a
      reliable source of information regarding source_port and switch_id for
      link-local traffic (tag_8021q VLANs may be imprecise since now they
      identify an entire bridging domain when ports are not standalone).
      
      If we accept that PTP RX timestamping (and therefore, meta frame
      generation) is always enabled in hardware, then that limitation could be
      avoided and packets with any MAC DA can be properly received, because
      meta frames do contain the original bytes from the MAC DA of their
      associated link-local packet.
      
      This change enables meta frame generation unconditionally, which also
      has the nice side effects of simplifying the switch control path
      (a switch reset is no longer required on hwtstamping settings change)
      and the tagger data path (it no longer needs to be informed whether to
      expect meta frames or not - it always does).
      
      Fixes: 227d07a0 ("net: dsa: sja1105: Add support for traffic through standalone ports")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a372d66a
    • Vladimir Oltean's avatar
      net: dsa: tag_sja1105: fix MAC DA patching from meta frames · 1dcf6efd
      Vladimir Oltean authored
      The SJA1105 manual says that at offset 4 into the meta frame payload we
      have "MAC destination byte 2" and at offset 5 we have "MAC destination
      byte 1". These are counted from the LSB, so byte 1 is h_dest[ETH_HLEN-2]
      aka h_dest[4] and byte 2 is h_dest[ETH_HLEN-3] aka h_dest[3].
      
      The sja1105_meta_unpack() function decodes these the other way around,
      so a frame with MAC DA 01:80:c2:11:22:33 is received by the network
      stack as having 01:80:c2:22:11:33.
      
      Fixes: e53e18a6 ("net: dsa: sja1105: Receive and decode meta frames")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1dcf6efd
    • Azeem Shaikh's avatar
      net: Replace strlcpy with strscpy · ba7bdec3
      Azeem Shaikh authored
      strlcpy() reads the entire source buffer first.
      This read may exceed the destination size limit.
      This is both inefficient and can lead to linear read
      overflows if a source string is not NUL-terminated [1].
      In an effort to remove strlcpy() completely [2], replace
      strlcpy() here with strscpy().
      No return values were used, so direct replacement is safe.
      
      [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
      [2] https://github.com/KSPP/linux/issues/89Signed-off-by: default avatarAzeem Shaikh <azeemshaikh38@gmail.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba7bdec3
    • Guillaume Nault's avatar
      pptp: Fix fib lookup calls. · 84bef5b6
      Guillaume Nault authored
      PPTP uses pppox sockets (struct pppox_sock). These sockets don't embed
      an inet_sock structure, so it's invalid to call inet_sk() on them.
      
      Therefore, the ip_route_output_ports() call in pptp_connect() has two
      problems:
      
        * The tos variable is set with RT_CONN_FLAGS(sk), which calls
          inet_sk() on the pppox socket.
      
        * ip_route_output_ports() tries to retrieve routing flags using
          inet_sk_flowi_flags(), which is also going to call inet_sk() on the
          pppox socket.
      
      While PPTP doesn't use inet sockets, it's actually really layered on
      top of IP and therefore needs a proper way to do fib lookups. So let's
      define pptp_route_output() to get a struct rtable from a pptp socket.
      Let's also replace the ip_route_output_ports() call of pptp_xmit() for
      consistency.
      
      In practice, this means that:
      
        * pptp_connect() sets ->flowi4_tos and ->flowi4_flags to zero instead
          of using bits of unrelated struct pppox_sock fields.
      
        * pptp_xmit() now respects ->sk_mark and ->sk_uid.
      
        * pptp_xmit() now calls the security_sk_classify_flow() security
          hook, thus allowing to set ->flowic_secid.
      
        * pptp_xmit() now passes the pppox socket to xfrm_lookup_route().
      
      Found by code inspection.
      
      Fixes: 00959ade ("PPTP: PPP over IPv4 (Point-to-Point Tunneling Protocol)")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84bef5b6
    • Dan Carpenter's avatar
      mlxsw: spectrum_router: Fix an IS_ERR() vs NULL check · 90a8007b
      Dan Carpenter authored
      The mlxsw_sp_crif_alloc() function returns NULL on error.  It doesn't
      return error pointers.  Fix the check.
      
      Fixes: 78126cfd ("mlxsw: spectrum_router: Maintain CRIF for fallback loopback RIF")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90a8007b
    • Lin Ma's avatar
      net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX · 30c45b53
      Lin Ma authored
      The attribute TCA_PEDIT_PARMS_EX is not be included in pedit_policy and
      one malicious user could fake a TCA_PEDIT_PARMS_EX whose length is
      smaller than the intended sizeof(struct tc_pedit). Hence, the
      dereference in tcf_pedit_init() could access dirty heap data.
      
      static int tcf_pedit_init(...)
      {
        // ...
        pattr = tb[TCA_PEDIT_PARMS]; // TCA_PEDIT_PARMS is included
        if (!pattr)
          pattr = tb[TCA_PEDIT_PARMS_EX]; // but this is not
      
        // ...
        parm = nla_data(pattr);
      
        index = parm->index; // parm is able to be smaller than 4 bytes
                             // and this dereference gets dirty skb_buff
                             // data created in netlink_sendmsg
      }
      
      This commit adds TCA_PEDIT_PARMS_EX length in pedit_policy which avoid
      the above case, just like the TCA_PEDIT_PARMS.
      
      Fixes: 71d0ed70 ("net/act_pedit: Support using offset relative to the conventional network headers")
      Signed-off-by: default avatarLin Ma <linma@zju.edu.cn>
      Reviewed-by: default avatarPedro Tammela <pctammela@mojatatu.com>
      Link: https://lore.kernel.org/r/20230703110842.590282-1-linma@zju.edu.cnSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      30c45b53
  2. 03 Jul, 2023 12 commits
    • Rahul Rameshbabu's avatar
      ptp: Make max_phase_adjustment sysfs device attribute invisible when not supported · 2c5d234d
      Rahul Rameshbabu authored
      The .adjphase operation is an operation that is implemented only by certain
      PHCs. The sysfs device attribute node for querying the maximum phase
      adjustment supported should not be exposed on devices that do not support
      .adjphase.
      
      Fixes: c3b60ab7 ("ptp: Add .getmaxphase callback to ptp_clock_info")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reported-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Reported-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Link: https://lore.kernel.org/netdev/20230627162146.GA114473@dev-arch.thelio-3990X/
      Link: https://lore.kernel.org/all/CA+G9fYtKCZeAUTtwe69iK8Xcz1mOKQzwcy49wd+imZrfj6ifXA@mail.gmail.com/Tested-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Reviewed-by: default avatarPetr Vorel <pvorel@suse.cz>
      Message-ID: <20230627232139.213130-1-rrameshbabu@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2c5d234d
    • Subash Abhinov Kasiviswanathan's avatar
      Documentation: ABI: sysfs-class-net-qmi: pass_through contact update · acd97558
      Subash Abhinov Kasiviswanathan authored
      Switch to the quicinc.com id.
      
      Fixes: bd1af6b5 ("Documentation: ABI: sysfs-class-net-qmi: document pass-through file")
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acd97558
    • Eric Dumazet's avatar
      tcp: annotate data races in __tcp_oow_rate_limited() · 998127cd
      Eric Dumazet authored
      request sockets are lockless, __tcp_oow_rate_limited() could be called
      on the same object from different cpus. This is harmless.
      
      Add READ_ONCE()/WRITE_ONCE() annotations to avoid a KCSAN report.
      
      Fixes: 4ce7e93c ("tcp: rate limit ACK sent by SYN_RECV request sockets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      998127cd
    • David S. Miller's avatar
      Merge branch 'wireguard-fixes' · c94683ed
      David S. Miller authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard fixes for 6.4.2/6.5-rc1
      
      Sorry to send these patches during the merge window, but they're net
      fixes, not netdev enhancements, and while I'd ordinarily wait anyway,
      I just got a first bug report for one of these fixes, which I originally
      had thought was mostly unlikely. So please apply the following three
      patches to net:
      
      1) Make proper use of nr_cpu_ids with cpumask_next(), rather than
         awkwardly using modulo, to handle dynamic CPU topology changes.
         Linus noticed this a while ago and pointed it out, and today a user
         actually got hit by it.
      
      2) Respect persistent keepalive and other staged packets when setting
         the private key after the interface is already up.
      
      3) Use timer_delete_sync() instead of del_timer_sync(), per the
         documentation.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c94683ed
    • Jason A. Donenfeld's avatar
      wireguard: timers: move to using timer_delete_sync · 326534e8
      Jason A. Donenfeld authored
      The documentation says that del_timer_sync is obsolete, and code should
      use the equivalent timer_delete_sync instead, so switch to it.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      326534e8
    • Jason A. Donenfeld's avatar
      wireguard: netlink: send staged packets when setting initial private key · f58d0a9b
      Jason A. Donenfeld authored
      Packets bound for peers can queue up prior to the device private key
      being set. For example, if persistent keepalive is set, a packet is
      queued up to be sent as soon as the device comes up. However, if the
      private key hasn't been set yet, the handshake message never sends, and
      no timer is armed to retry, since that would be pointless.
      
      But, if a user later sets a private key, the expectation is that those
      queued packets, such as a persistent keepalive, are actually sent. So
      adjust the configuration logic to account for this edge case, and add a
      test case to make sure this works.
      
      Maxim noticed this with a wg-quick(8) config to the tune of:
      
          [Interface]
          PostUp = wg set %i private-key somefile
      
          [Peer]
          PublicKey = ...
          Endpoint = ...
          PersistentKeepalive = 25
      
      Here, the private key gets set after the device comes up using a PostUp
      script, triggering the bug.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarMaxim Cournoyer <maxim.cournoyer@gmail.com>
      Tested-by: default avatarMaxim Cournoyer <maxim.cournoyer@gmail.com>
      Link: https://lore.kernel.org/wireguard/87fs7xtqrv.fsf@gmail.com/Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f58d0a9b
    • Jason A. Donenfeld's avatar
      wireguard: queueing: use saner cpu selection wrapping · 7387943f
      Jason A. Donenfeld authored
      Using `% nr_cpumask_bits` is slow and complicated, and not totally
      robust toward dynamic changes to CPU topologies. Rather than storing the
      next CPU in the round-robin, just store the last one, and also return
      that value. This simplifies the loop drastically into a much more common
      pattern.
      
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Tested-by: default avatarManuel Leiner <manuel.leiner@gmx.de>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7387943f
    • J.J. Martzki's avatar
      samples: pktgen: fix append mode failed issue · a27ac539
      J.J. Martzki authored
      Each sample script sources functions.sh before parameters.sh
      which makes $APPEND undefined when trapping EXIT no matter in
      append mode or not. Due to this when sample scripts finished
      they always do "pgctrl reset" which resets pktgen config.
      
      So move trap to each script after sourcing parameters.sh
      and trap EXIT explicitly.
      Signed-off-by: default avatarJ.J. Martzki <mars14850@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a27ac539
    • Daniel Díaz's avatar
      selftests/net: Add xt_policy config for xfrm_policy test · f56d1eea
      Daniel Díaz authored
      When running Kselftests with the current selftests/net/config
      the following problem can be seen with the net:xfrm_policy.sh
      selftest:
      
        # selftests: net: xfrm_policy.sh
        [   41.076721] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
        [   41.094787] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
        [   41.107635] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
        # modprobe: FATAL: Module ip_tables not found in directory /lib/modules/6.1.36
        # iptables v1.8.7 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
        # Perhaps iptables or your kernel needs to be upgraded.
        # modprobe: FATAL: Module ip_tables not found in directory /lib/modules/6.1.36
        # iptables v1.8.7 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
        # Perhaps iptables or your kernel needs to be upgraded.
        # SKIP: Could not insert iptables rule
        ok 1 selftests: net: xfrm_policy.sh # SKIP
      
      This is because IPsec "policy" match support is not available
      to the kernel.
      
      This patch adds CONFIG_NETFILTER_XT_MATCH_POLICY as a module
      to the selftests/net/config file, so that `make
      kselftest-merge` can take this into consideration.
      Signed-off-by: default avatarDaniel Díaz <daniel.diaz@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f56d1eea
    • Eric Dumazet's avatar
      net: fix net_dev_start_xmit trace event vs skb_transport_offset() · f88fcb1d
      Eric Dumazet authored
      After blamed commit, we must be more careful about using
      skb_transport_offset(), as reminded us by syzbot:
      
      WARNING: CPU: 0 PID: 10 at include/linux/skbuff.h:2868 skb_transport_offset include/linux/skbuff.h:2977 [inline]
      WARNING: CPU: 0 PID: 10 at include/linux/skbuff.h:2868 perf_trace_net_dev_start_xmit+0x89a/0xce0 include/trace/events/net.h:14
      Modules linked in:
      CPU: 0 PID: 10 Comm: kworker/u4:1 Not tainted 6.1.30-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
      Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
      RIP: 0010:skb_transport_header include/linux/skbuff.h:2868 [inline]
      RIP: 0010:skb_transport_offset include/linux/skbuff.h:2977 [inline]
      RIP: 0010:perf_trace_net_dev_start_xmit+0x89a/0xce0 include/trace/events/net.h:14
      Code: 8b 04 25 28 00 00 00 48 3b 84 24 c0 00 00 00 0f 85 4e 04 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc e8 56 22 01 fd <0f> 0b e9 f6 fc ff ff 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c 86 f9 ff
      RSP: 0018:ffffc900002bf700 EFLAGS: 00010293
      RAX: ffffffff8485d8ca RBX: 000000000000ffff RCX: ffff888100914280
      RDX: 0000000000000000 RSI: 000000000000ffff RDI: 000000000000ffff
      RBP: ffffc900002bf818 R08: ffffffff8485d5b6 R09: fffffbfff0f8fb5e
      R10: 0000000000000000 R11: dffffc0000000001 R12: 1ffff110217d8f67
      R13: ffff88810bec7b3a R14: dffffc0000000000 R15: dffffc0000000000
      FS: 0000000000000000(0000) GS:ffff8881f6a00000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f96cf6d52f0 CR3: 000000012224c000 CR4: 0000000000350ef0
      Call Trace:
      <TASK>
      [<ffffffff84715e35>] trace_net_dev_start_xmit include/trace/events/net.h:14 [inline]
      [<ffffffff84715e35>] xmit_one net/core/dev.c:3643 [inline]
      [<ffffffff84715e35>] dev_hard_start_xmit+0x705/0x980 net/core/dev.c:3660
      [<ffffffff8471a232>] __dev_queue_xmit+0x16b2/0x3370 net/core/dev.c:4324
      [<ffffffff85416493>] dev_queue_xmit include/linux/netdevice.h:3030 [inline]
      [<ffffffff85416493>] batadv_send_skb_packet+0x3f3/0x680 net/batman-adv/send.c:108
      [<ffffffff85416744>] batadv_send_broadcast_skb+0x24/0x30 net/batman-adv/send.c:127
      [<ffffffff853bc52a>] batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:393 [inline]
      [<ffffffff853bc52a>] batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:421 [inline]
      [<ffffffff853bc52a>] batadv_iv_send_outstanding_bat_ogm_packet+0x69a/0x840 net/batman-adv/bat_iv_ogm.c:1701
      [<ffffffff8151023c>] process_one_work+0x8ac/0x1170 kernel/workqueue.c:2289
      [<ffffffff81511938>] worker_thread+0xaa8/0x12d0 kernel/workqueue.c:2436
      
      Fixes: 66e4c8d9 ("net: warn if transport header was not set")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f88fcb1d
    • Vladimir Oltean's avatar
      net: dsa: tag_sja1105: fix source port decoding in vlan_filtering=0 bridge mode · a398b9ea
      Vladimir Oltean authored
      There was a regression introduced by the blamed commit, where pinging to
      a VLAN-unaware bridge would fail with the repeated message "Couldn't
      decode source port" coming from the tagging protocol driver.
      
      When receiving packets with a bridge_vid as determined by
      dsa_tag_8021q_bridge_join(), dsa_8021q_rcv() will decode:
      - source_port = 0 (which isn't really valid, more like "don't know")
      - switch_id = 0 (which isn't really valid, more like "don't know")
      - vbid = value in range 1-7
      
      Since the blamed patch has reversed the order of the checks, we are now
      going to believe that source_port != -1 and switch_id != -1, so they're
      valid, but they aren't.
      
      The minimal solution to the problem is to only populate source_port and
      switch_id with what dsa_8021q_rcv() came up with, if the vbid is zero,
      i.e. the source port information is trustworthy.
      
      Fixes: c1ae02d8 ("net: dsa: tag_sja1105: always prefer source port information from INCL_SRCPT")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a398b9ea
    • Vladimir Oltean's avatar
      net: bridge: keep ports without IFF_UNICAST_FLT in BR_PROMISC mode · 6ca3c005
      Vladimir Oltean authored
      According to the synchronization rules for .ndo_get_stats() as seen in
      Documentation/networking/netdevices.rst, acquiring a plain spin_lock()
      should not be illegal, but the bridge driver implementation makes it so.
      
      After running these commands, I am being faced with the following
      lockdep splat:
      
      $ ip link add link swp0 name macsec0 type macsec encrypt on && ip link set swp0 up
      $ ip link add dev br0 type bridge vlan_filtering 1 && ip link set br0 up
      $ ip link set macsec0 master br0 && ip link set macsec0 up
      
        ========================================================
        WARNING: possible irq lock inversion dependency detected
        6.4.0-04295-g31b577b4bd4a #603 Not tainted
        --------------------------------------------------------
        swapper/1/0 just changed the state of lock:
        ffff6bd348724cd8 (&br->lock){+.-.}-{3:3}, at: br_forward_delay_timer_expired+0x34/0x198
        but this lock took another, SOFTIRQ-unsafe lock in the past:
         (&ocelot->stats_lock){+.+.}-{3:3}
      
        and interrupts could create inverse lock ordering between them.
      
        other info that might help us debug this:
        Chain exists of:
          &br->lock --> &br->hash_lock --> &ocelot->stats_lock
      
         Possible interrupt unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&ocelot->stats_lock);
                                       local_irq_disable();
                                       lock(&br->lock);
                                       lock(&br->hash_lock);
          <Interrupt>
            lock(&br->lock);
      
         *** DEADLOCK ***
      
      (details about the 3 locks skipped)
      
      swp0 is instantiated by drivers/net/dsa/ocelot/felix.c, and this
      only matters to the extent that its .ndo_get_stats64() method calls
      spin_lock(&ocelot->stats_lock).
      
      Documentation/locking/lockdep-design.rst says:
      
      | A lock is irq-safe means it was ever used in an irq context, while a lock
      | is irq-unsafe means it was ever acquired with irq enabled.
      
      (...)
      
      | Furthermore, the following usage based lock dependencies are not allowed
      | between any two lock-classes::
      |
      |    <hardirq-safe>   ->  <hardirq-unsafe>
      |    <softirq-safe>   ->  <softirq-unsafe>
      
      Lockdep marks br->hash_lock as softirq-safe, because it is sometimes
      taken in softirq context (for example br_fdb_update() which runs in
      NET_RX softirq), and when it's not in softirq context it blocks softirqs
      by using spin_lock_bh().
      
      Lockdep marks ocelot->stats_lock as softirq-unsafe, because it never
      blocks softirqs from running, and it is never taken from softirq
      context. So it can always be interrupted by softirqs.
      
      There is a call path through which a function that holds br->hash_lock:
      fdb_add_hw_addr() will call a function that acquires ocelot->stats_lock:
      ocelot_port_get_stats64(). This can be seen below:
      
      ocelot_port_get_stats64+0x3c/0x1e0
      felix_get_stats64+0x20/0x38
      dsa_slave_get_stats64+0x3c/0x60
      dev_get_stats+0x74/0x2c8
      rtnl_fill_stats+0x4c/0x150
      rtnl_fill_ifinfo+0x5cc/0x7b8
      rtmsg_ifinfo_build_skb+0xe4/0x150
      rtmsg_ifinfo+0x5c/0xb0
      __dev_notify_flags+0x58/0x200
      __dev_set_promiscuity+0xa0/0x1f8
      dev_set_promiscuity+0x30/0x70
      macsec_dev_change_rx_flags+0x68/0x88
      __dev_set_promiscuity+0x1a8/0x1f8
      __dev_set_rx_mode+0x74/0xa8
      dev_uc_add+0x74/0xa0
      fdb_add_hw_addr+0x68/0xd8
      fdb_add_local+0xc4/0x110
      br_fdb_add_local+0x54/0x88
      br_add_if+0x338/0x4a0
      br_add_slave+0x20/0x38
      do_setlink+0x3a4/0xcb8
      rtnl_newlink+0x758/0x9d0
      rtnetlink_rcv_msg+0x2f0/0x550
      netlink_rcv_skb+0x128/0x148
      rtnetlink_rcv+0x24/0x38
      
      the plain English explanation for it is:
      
      The macsec0 bridge port is created without p->flags & BR_PROMISC,
      because it is what br_manage_promisc() decides for a VLAN filtering
      bridge with a single auto port.
      
      As part of the br_add_if() procedure, br_fdb_add_local() is called for
      the MAC address of the device, and this results in a call to
      dev_uc_add() for macsec0 while the softirq-safe br->hash_lock is taken.
      
      Because macsec0 does not have IFF_UNICAST_FLT, dev_uc_add() ends up
      calling __dev_set_promiscuity() for macsec0, which is propagated by its
      implementation, macsec_dev_change_rx_flags(), to the lower device: swp0.
      This triggers the call path:
      
      dev_set_promiscuity(swp0)
      -> rtmsg_ifinfo()
         -> dev_get_stats()
            -> ocelot_port_get_stats64()
      
      with a calling context that lockdep doesn't like (br->hash_lock held).
      
      Normally we don't see this, because even though many drivers that can be
      bridge ports don't support IFF_UNICAST_FLT, we need a driver that
      
      (a) doesn't support IFF_UNICAST_FLT, *and*
      (b) it forwards the IFF_PROMISC flag to another driver, and
      (c) *that* driver implements ndo_get_stats64() using a softirq-unsafe
          spinlock.
      
      Condition (b) is necessary because the first __dev_set_rx_mode() calls
      __dev_set_promiscuity() with "bool notify=false", and thus, the
      rtmsg_ifinfo() code path won't be entered.
      
      The same criteria also hold true for DSA switches which don't report
      IFF_UNICAST_FLT. When the DSA master uses a spin_lock() in its
      ndo_get_stats64() method, the same lockdep splat can be seen.
      
      I think the deadlock possibility is real, even though I didn't reproduce
      it, and I'm thinking of the following situation to support that claim:
      
      fdb_add_hw_addr() runs on a CPU A, in a context with softirqs locally
      disabled and br->hash_lock held, and may end up attempting to acquire
      ocelot->stats_lock.
      
      In parallel, ocelot->stats_lock is currently held by a thread B (say,
      ocelot_check_stats_work()), which is interrupted while holding it by a
      softirq which attempts to lock br->hash_lock.
      
      Thread B cannot make progress because br->hash_lock is held by A. Whereas
      thread A cannot make progress because ocelot->stats_lock is held by B.
      
      When taking the issue at face value, the bridge can avoid that problem
      by simply making the ports promiscuous from a code path with a saner
      calling context (br->hash_lock not held). A bridge port without
      IFF_UNICAST_FLT is going to become promiscuous as soon as we call
      dev_uc_add() on it (which we do unconditionally), so why not be
      preemptive and make it promiscuous right from the beginning, so as to
      not be taken by surprise.
      
      With this, we've broken the links between code that holds br->hash_lock
      or br->lock and code that calls into the ndo_change_rx_flags() or
      ndo_get_stats64() ops of the bridge port.
      
      Fixes: 2796d0c6 ("bridge: Automatically manage port promiscuous mode.")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ca3c005
  3. 02 Jul, 2023 6 commits
  4. 01 Jul, 2023 3 commits
  5. 30 Jun, 2023 2 commits
  6. 29 Jun, 2023 10 commits