1. 25 Apr, 2024 33 commits
    • Arkadiusz Kubalewski's avatar
      dpll: fix dpll_pin_on_pin_register() for multiple parent pins · 38d7b94e
      Arkadiusz Kubalewski authored
      In scenario where pin is registered with multiple parent pins via
      dpll_pin_on_pin_register(..), all belonging to the same dpll device.
      A second call to dpll_pin_on_pin_unregister(..) would cause a call trace,
      as it tries to use already released registration resources (due to fix
      introduced in b446631f). In this scenario pin was registered twice,
      so resources are not yet expected to be release until each registered
      pin/pin pair is unregistered.
      
      Currently, the following crash/call trace is produced when ice driver is
      removed on the system with installed E810T NIC which includes dpll device:
      
      WARNING: CPU: 51 PID: 9155 at drivers/dpll/dpll_core.c:809 dpll_pin_ops+0x20/0x30
      RIP: 0010:dpll_pin_ops+0x20/0x30
      Call Trace:
       ? __warn+0x7f/0x130
       ? dpll_pin_ops+0x20/0x30
       dpll_msg_add_pin_freq+0x37/0x1d0
       dpll_cmd_pin_get_one+0x1c0/0x400
       ? __nlmsg_put+0x63/0x80
       dpll_pin_event_send+0x93/0x140
       dpll_pin_on_pin_unregister+0x3f/0x100
       ice_dpll_deinit_pins+0xa1/0x230 [ice]
       ice_remove+0xf1/0x210 [ice]
      
      Fix by adding a parent pointer as a cookie when creating a registration,
      also when searching for it. For the regular pins pass NULL, this allows to
      create separated registration for each parent the pin is registered with.
      
      Fixes: b446631f ("dpll: fix dpll_xa_ref_*_del() for multiple registrations")
      Signed-off-by: default avatarArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20240424101636.1491424-1-arkadiusz.kubalewski@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      38d7b94e
    • Geert Uytterhoeven's avatar
      net: ravb: Fix registered interrupt names · 0c81ea5a
      Geert Uytterhoeven authored
      As interrupts are now requested from ravb_probe(), before calling
      register_netdev(), ndev->name still contains the template "eth%d",
      leading to funny names in /proc/interrupts.  E.g. on R-Car E3:
      
      	89:  0      0  GICv2  93 Level  eth%d:ch22:multi
      	90:  0      3  GICv2  95 Level  eth%d:ch24:emac
      	91:  0  23484  GICv2  71 Level  eth%d:ch0:rx_be
      	92:  0      0  GICv2  72 Level  eth%d:ch1:rx_nc
      	93:  0  13735  GICv2  89 Level  eth%d:ch18:tx_be
      	94:  0      0  GICv2  90 Level  eth%d:ch19:tx_nc
      
      Worse, on platforms with multiple RAVB instances (e.g. R-Car V4H), all
      interrupts have similar names.
      
      Fix this by using the device name instead, like is done in several other
      drivers:
      
      	89:  0      0  GICv2  93 Level  e6800000.ethernet:ch22:multi
      	90:  0      1  GICv2  95 Level  e6800000.ethernet:ch24:emac
      	91:  0  28578  GICv2  71 Level  e6800000.ethernet:ch0:rx_be
      	92:  0      0  GICv2  72 Level  e6800000.ethernet:ch1:rx_nc
      	93:  0  14044  GICv2  89 Level  e6800000.ethernet:ch18:tx_be
      	94:  0      0  GICv2  90 Level  e6800000.ethernet:ch19:tx_nc
      
      Rename the local variable dev_name, as it shadows the dev_name()
      function, and pre-initialize it, to simplify the code.
      
      Fixes: 32f012b8 ("net: ravb: Move getting/requesting IRQs in the probe() method")
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarNiklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
      Reviewed-by: default avatarSergey Shtylyov <s.shtylyov@omp.ru>
      Reviewed-by: default avatarClaudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
      Tested-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> # on RZ/G3S
      Link: https://lore.kernel.org/r/cde67b68adf115b3cf0b44c32334ae00b2fbb321.1713944647.git.geert+renesas@glider.beSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0c81ea5a
    • Su Hui's avatar
      octeontx2-af: fix the double free in rvu_npc_freemem() · 6e965eba
      Su Hui authored
      Clang static checker(scan-build) warning:
      drivers/net/ethernet/marvell/octeontx2/af/rvu_npc.c:line 2184, column 2
      Attempt to free released memory.
      
      npc_mcam_rsrcs_deinit() has released 'mcam->counters.bmap'. Deleted this
      redundant kfree() to fix this double free problem.
      
      Fixes: dd784287 ("octeontx2-af: Add new devlink param to configure maximum usable NIX block LFs")
      Signed-off-by: default avatarSu Hui <suhui@nfschina.com>
      Reviewed-by: default avatarGeetha sowjanya <gakula@marvell.com>
      Reviewed-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Reviewed-by: default avatarHariprasad Kelam <hkelam@marvell.com>
      Link: https://lore.kernel.org/r/20240424022724.144587-1-suhui@nfschina.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6e965eba
    • Jason Reeder's avatar
      net: ethernet: ti: am65-cpts: Fix PTPv1 message type on TX packets · 1b9e743e
      Jason Reeder authored
      The CPTS, by design, captures the messageType (Sync, Delay_Req, etc.)
      field from the second nibble of the PTP header which is defined in the
      PTPv2 (1588-2008) specification. In the PTPv1 (1588-2002) specification
      the first two bytes of the PTP header are defined as the versionType
      which is always 0x0001. This means that any PTPv1 packets that are
      tagged for TX timestamping by the CPTS will have their messageType set
      to 0x0 which corresponds to a Sync message type. This causes issues
      when a PTPv1 stack is expecting a Delay_Req (messageType: 0x1)
      timestamp that never appears.
      
      Fix this by checking if the ptp_class of the timestamped TX packet is
      PTP_CLASS_V1 and then matching the PTP sequence ID to the stored
      sequence ID in the skb->cb data structure. If the sequence IDs match
      and the packet is of type PTPv1 then there is a chance that the
      messageType has been incorrectly stored by the CPTS so overwrite the
      messageType stored by the CPTS with the messageType from the skb->cb
      data structure. This allows the PTPv1 stack to receive TX timestamps
      for Delay_Req packets which are necessary to lock onto a PTP Leader.
      Signed-off-by: default avatarJason Reeder <jreeder@ti.com>
      Signed-off-by: default avatarRavi Gunasekaran <r-gunasekaran@ti.com>
      Tested-by: default avatarEd Trexel <ed.trexel@hp.com>
      Fixes: f6bd5952 ("net: ethernet: ti: introduce am654 common platform time sync driver")
      Link: https://lore.kernel.org/r/20240424071626.32558-1-r-gunasekaran@ti.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1b9e743e
    • Jakub Kicinski's avatar
      Merge branch 'intel-wired-lan-driver-updates-2024-04-23-i40e-iavf-ice' · 179d5166
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2024-04-23 (i40e, iavf, ice)
      
      This series contains updates to i40e, iavf, and ice drivers.
      
      Sindhu removes WQ_MEM_RECLAIM flag from workqueue for i40e.
      
      Erwan Velu adjusts message to avoid confusion on base being reported on
      i40e.
      
      Sudheer corrects insufficient check for TC equality on iavf.
      
      Jake corrects ordering of locks to avoid possible deadlock on ice.
      ====================
      
      Link: https://lore.kernel.org/r/20240423182723.740401-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      179d5166
    • Jakub Kicinski's avatar
      Merge branch 'fix-isolation-of-broadcast-traffic-and-unmatched-unicast-traffic-with-macsec-offload' · 4334496e
      Jakub Kicinski authored
      Rahul Rameshbabu says:
      
      ====================
      Fix isolation of broadcast traffic and unmatched unicast traffic with MACsec offload
      
      Some device drivers support devices that enable them to annotate whether a
      Rx skb refers to a packet that was processed by the MACsec offloading
      functionality of the device. Logic in the Rx handling for MACsec offload
      does not utilize this information to preemptively avoid forwarding to the
      macsec netdev currently. Because of this, things like multicast messages or
      unicast messages with an unmatched destination address such as ARP requests
      are forwarded to the macsec netdev whether the message received was MACsec
      encrypted or not. The goal of this patch series is to improve the Rx
      handling for MACsec offload for devices capable of annotating skbs received
      that were decrypted by the NIC offload for MACsec.
      
      Here is a summary of the issue that occurs with the existing logic today.
      
          * The current design of the MACsec offload handling path tries to use
            "best guess" mechanisms for determining whether a packet associated
            with the currently handled skb in the datapath was processed via HW
            offload
          * The best guess mechanism uses the following heuristic logic (in order of
            precedence)
            - Check if header destination MAC address matches MACsec netdev MAC
              address -> forward to MACsec port
            - Check if packet is multicast traffic -> forward to MACsec port
            - MACsec security channel was able to be looked up from skb offload
              context (mlx5 only) -> forward to MACsec port
          * Problem: plaintext traffic can potentially solicit a MACsec encrypted
            response from the offload device
            - Core aspect of MACsec is that it identifies unauthorized LAN connections
              and excludes them from communication
              + This behavior can be seen when not enabling offload for MACsec
            - The offload behavior violates this principle in MACsec
      
      I believe this behavior is a security bug since applications utilizing
      MACsec could be exploited using this behavior, and the correct way to
      resolve this is by having the hardware correctly indicate whether MACsec
      offload occurred for the packet or not. In the patches in this series, I
      leave a warning for when the problematic path occurs because I cannot
      figure out a secure way to fix the security issue that applies to the core
      MACsec offload handling in the Rx path without breaking MACsec offload for
      other vendors.
      
      Shown at the bottom is an example use case where plaintext traffic sent to
      a physical port of a NIC configured for MACsec offload is unable to be
      handled correctly by the software stack when the NIC provides awareness to
      the kernel about whether the received packet is MACsec traffic or not. In
      this specific example, plaintext ARP requests are being responded with
      MACsec encrypted ARP replies (which leads to routing information being
      unable to be built for the requester).
      
          Side 1
      
            ip link del macsec0
            ip address flush mlx5_1
            ip address add 1.1.1.1/24 dev mlx5_1
            ip link set dev mlx5_1 up
            ip link add link mlx5_1 macsec0 type macsec sci 1 encrypt on
            ip link set dev macsec0 address 00:11:22:33:44:66
            ip macsec offload macsec0 mac
            ip macsec add macsec0 tx sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16
            ip macsec add macsec0 rx sci 2 on
            ip macsec add macsec0 rx sci 2 sa 0 pn 1 on key 00 ead3664f508eb06c40ac7104cdae4ce5
            ip address flush macsec0
            ip address add 2.2.2.1/24 dev macsec0
            ip link set dev macsec0 up
      
            # macsec0 enters promiscuous mode.
            # This enables all traffic received on macsec_vlan to be processed by
            # the macsec offload rx datapath. This however means that traffic
            # meant to be received by mlx5_1 will be incorrectly steered to
            # macsec0 as well.
      
            ip link add link macsec0 name macsec_vlan type vlan id 1
            ip link set dev macsec_vlan address 00:11:22:33:44:88
            ip address flush macsec_vlan
            ip address add 3.3.3.1/24 dev macsec_vlan
            ip link set dev macsec_vlan up
      
          Side 2
      
            ip link del macsec0
            ip address flush mlx5_1
            ip address add 1.1.1.2/24 dev mlx5_1
            ip link set dev mlx5_1 up
            ip link add link mlx5_1 macsec0 type macsec sci 2 encrypt on
            ip link set dev macsec0 address 00:11:22:33:44:77
            ip macsec offload macsec0 mac
            ip macsec add macsec0 tx sa 0 pn 1 on key 00 ead3664f508eb06c40ac7104cdae4ce5
            ip macsec add macsec0 rx sci 1 on
            ip macsec add macsec0 rx sci 1 sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16
            ip address flush macsec0
            ip address add 2.2.2.2/24 dev macsec0
            ip link set dev macsec0 up
      
            # macsec0 enters promiscuous mode.
            # This enables all traffic received on macsec_vlan to be processed by
            # the macsec offload rx datapath. This however means that traffic
            # meant to be received by mlx5_1 will be incorrectly steered to
            # macsec0 as well.
      
            ip link add link macsec0 name macsec_vlan type vlan id 1
            ip link set dev macsec_vlan address 00:11:22:33:44:99
            ip address flush macsec_vlan
            ip address add 3.3.3.2/24 dev macsec_vlan
            ip link set dev macsec_vlan up
      
          Side 1
      
            ping -I mlx5_1 1.1.1.2
            PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 mlx5_1: 56(84) bytes of data.
            From 1.1.1.1 icmp_seq=1 Destination Host Unreachable
            ping: sendmsg: No route to host
            From 1.1.1.1 icmp_seq=2 Destination Host Unreachable
            From 1.1.1.1 icmp_seq=3 Destination Host Unreachable
      
      Changes:
      
        v2->v3:
          * Made dev paramater const for eth_skb_pkt_type helper as suggested by Sabrina
            Dubroca <sd@queasysnail.net>
        v1->v2:
          * Fixed series subject to detail the issue being fixed
          * Removed strange characters from cover letter
          * Added comment in example that illustrates the impact involving
            promiscuous mode
          * Added patch for generalizing packet type detection
          * Added Fixes: tags and targeting net
          * Removed pointless warning in the heuristic Rx path for macsec offload
          * Applied small refactor in Rx path offload to minimize scope of rx_sc
            local variable
      
      Link: https://github.com/Binary-Eater/macsec-rx-offload/blob/trunk/MACsec_violation_in_core_stack_offload_rx_handling.pdf
      Link: https://lore.kernel.org/netdev/20240419213033.400467-5-rrameshbabu@nvidia.com/
      Link: https://lore.kernel.org/netdev/20240419011740.333714-1-rrameshbabu@nvidia.com/
      Link: https://lore.kernel.org/netdev/87r0l25y1c.fsf@nvidia.com/
      Link: https://lore.kernel.org/netdev/20231116182900.46052-1-rrameshbabu@nvidia.com/
      ====================
      
      Link: https://lore.kernel.org/r/20240423181319.115860-1-rrameshbabu@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4334496e
    • Jacob Keller's avatar
      ice: fix LAG and VF lock dependency in ice_reset_vf() · 96fdd1f6
      Jacob Keller authored
      9f74a3df ("ice: Fix VF Reset paths when interface in a failed over
      aggregate"), the ice driver has acquired the LAG mutex in ice_reset_vf().
      The commit placed this lock acquisition just prior to the acquisition of
      the VF configuration lock.
      
      If ice_reset_vf() acquires the configuration lock via the ICE_VF_RESET_LOCK
      flag, this could deadlock with ice_vc_cfg_qs_msg() because it always
      acquires the locks in the order of the VF configuration lock and then the
      LAG mutex.
      
      Lockdep reports this violation almost immediately on creating and then
      removing 2 VF:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.8.0-rc6 #54 Tainted: G        W  O
      ------------------------------------------------------
      kworker/60:3/6771 is trying to acquire lock:
      ff40d43e099380a0 (&vf->cfg_lock){+.+.}-{3:3}, at: ice_reset_vf+0x22f/0x4d0 [ice]
      
      but task is already holding lock:
      ff40d43ea1961210 (&pf->lag_mutex){+.+.}-{3:3}, at: ice_reset_vf+0xb7/0x4d0 [ice]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&pf->lag_mutex){+.+.}-{3:3}:
             __lock_acquire+0x4f8/0xb40
             lock_acquire+0xd4/0x2d0
             __mutex_lock+0x9b/0xbf0
             ice_vc_cfg_qs_msg+0x45/0x690 [ice]
             ice_vc_process_vf_msg+0x4f5/0x870 [ice]
             __ice_clean_ctrlq+0x2b5/0x600 [ice]
             ice_service_task+0x2c9/0x480 [ice]
             process_one_work+0x1e9/0x4d0
             worker_thread+0x1e1/0x3d0
             kthread+0x104/0x140
             ret_from_fork+0x31/0x50
             ret_from_fork_asm+0x1b/0x30
      
      -> #0 (&vf->cfg_lock){+.+.}-{3:3}:
             check_prev_add+0xe2/0xc50
             validate_chain+0x558/0x800
             __lock_acquire+0x4f8/0xb40
             lock_acquire+0xd4/0x2d0
             __mutex_lock+0x9b/0xbf0
             ice_reset_vf+0x22f/0x4d0 [ice]
             ice_process_vflr_event+0x98/0xd0 [ice]
             ice_service_task+0x1cc/0x480 [ice]
             process_one_work+0x1e9/0x4d0
             worker_thread+0x1e1/0x3d0
             kthread+0x104/0x140
             ret_from_fork+0x31/0x50
             ret_from_fork_asm+0x1b/0x30
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(&pf->lag_mutex);
                                     lock(&vf->cfg_lock);
                                     lock(&pf->lag_mutex);
        lock(&vf->cfg_lock);
      
       *** DEADLOCK ***
      4 locks held by kworker/60:3/6771:
       #0: ff40d43e05428b38 ((wq_completion)ice){+.+.}-{0:0}, at: process_one_work+0x176/0x4d0
       #1: ff50d06e05197e58 ((work_completion)(&pf->serv_task)){+.+.}-{0:0}, at: process_one_work+0x176/0x4d0
       #2: ff40d43ea1960e50 (&pf->vfs.table_lock){+.+.}-{3:3}, at: ice_process_vflr_event+0x48/0xd0 [ice]
       #3: ff40d43ea1961210 (&pf->lag_mutex){+.+.}-{3:3}, at: ice_reset_vf+0xb7/0x4d0 [ice]
      
      stack backtrace:
      CPU: 60 PID: 6771 Comm: kworker/60:3 Tainted: G        W  O       6.8.0-rc6 #54
      Hardware name:
      Workqueue: ice ice_service_task [ice]
      Call Trace:
       <TASK>
       dump_stack_lvl+0x4a/0x80
       check_noncircular+0x12d/0x150
       check_prev_add+0xe2/0xc50
       ? save_trace+0x59/0x230
       ? add_chain_cache+0x109/0x450
       validate_chain+0x558/0x800
       __lock_acquire+0x4f8/0xb40
       ? lockdep_hardirqs_on+0x7d/0x100
       lock_acquire+0xd4/0x2d0
       ? ice_reset_vf+0x22f/0x4d0 [ice]
       ? lock_is_held_type+0xc7/0x120
       __mutex_lock+0x9b/0xbf0
       ? ice_reset_vf+0x22f/0x4d0 [ice]
       ? ice_reset_vf+0x22f/0x4d0 [ice]
       ? rcu_is_watching+0x11/0x50
       ? ice_reset_vf+0x22f/0x4d0 [ice]
       ice_reset_vf+0x22f/0x4d0 [ice]
       ? process_one_work+0x176/0x4d0
       ice_process_vflr_event+0x98/0xd0 [ice]
       ice_service_task+0x1cc/0x480 [ice]
       process_one_work+0x1e9/0x4d0
       worker_thread+0x1e1/0x3d0
       ? __pfx_worker_thread+0x10/0x10
       kthread+0x104/0x140
       ? __pfx_kthread+0x10/0x10
       ret_from_fork+0x31/0x50
       ? __pfx_kthread+0x10/0x10
       ret_from_fork_asm+0x1b/0x30
       </TASK>
      
      To avoid deadlock, we must acquire the LAG mutex only after acquiring the
      VF configuration lock. Fix the ice_reset_vf() to acquire the LAG mutex only
      after we either acquire or check that the VF configuration lock is held.
      
      Fixes: 9f74a3df ("ice: Fix VF Reset paths when interface in a failed over aggregate")
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarDave Ertman <david.m.ertman@intel.com>
      Reviewed-by: default avatarMateusz Polchlopek <mateusz.polchlopek@intel.com>
      Tested-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Tested-by: default avatarRafal Romanowski <rafal.romanowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://lore.kernel.org/r/20240423182723.740401-5-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      96fdd1f6
    • Sudheer Mogilappagari's avatar
      iavf: Fix TC config comparison with existing adapter TC config · 54976cf5
      Sudheer Mogilappagari authored
      Same number of TCs doesn't imply that underlying TC configs are
      same. The config could be different due to difference in number
      of queues in each TC. Add utility function to determine if TC
      configs are same.
      
      Fixes: d5b33d02 ("i40evf: add ndo_setup_tc callback to i40evf")
      Signed-off-by: default avatarSudheer Mogilappagari <sudheer.mogilappagari@intel.com>
      Tested-by: Mineri Bhange <minerix.bhange@intel.com> (A Contingent Worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://lore.kernel.org/r/20240423182723.740401-4-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      54976cf5
    • Erwan Velu's avatar
      i40e: Report MFS in decimal base instead of hex · ef3c3131
      Erwan Velu authored
      If the MFS is set below the default (0x2600), a warning message is
      reported like the following :
      
      	MFS for port 1 has been set below the default: 600
      
      This message is a bit confusing as the number shown here (600) is in
      fact an hexa number: 0x600 = 1536
      
      Without any explicit "0x" prefix, this message is read like the MFS is
      set to 600 bytes.
      
      MFS, as per MTUs, are usually expressed in decimal base.
      
      This commit reports both current and default MFS values in decimal
      so it's less confusing for end-users.
      
      A typical warning message looks like the following :
      
      	MFS for port 1 (1536) has been set below the default (9728)
      Signed-off-by: default avatarErwan Velu <e.velu@criteo.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: default avatarTony Brelinski <tony.brelinski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Fixes: 3a2c6ced ("i40e: Add a check to see if MFS is set")
      Link: https://lore.kernel.org/r/20240423182723.740401-3-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef3c3131
    • Sindhu Devale's avatar
      i40e: Do not use WQ_MEM_RECLAIM flag for workqueue · 2cc7d150
      Sindhu Devale authored
      Issue reported by customer during SRIOV testing, call trace:
      When both i40e and the i40iw driver are loaded, a warning
      in check_flush_dependency is being triggered. This seems
      to be because of the i40e driver workqueue is allocated with
      the WQ_MEM_RECLAIM flag, and the i40iw one is not.
      
      Similar error was encountered on ice too and it was fixed by
      removing the flag. Do the same for i40e too.
      
      [Feb 9 09:08] ------------[ cut here ]------------
      [  +0.000004] workqueue: WQ_MEM_RECLAIM i40e:i40e_service_task [i40e] is
      flushing !WQ_MEM_RECLAIM infiniband:0x0
      [  +0.000060] WARNING: CPU: 0 PID: 937 at kernel/workqueue.c:2966
      check_flush_dependency+0x10b/0x120
      [  +0.000007] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq
      snd_timer snd_seq_device snd soundcore nls_utf8 cifs cifs_arc4
      nls_ucs2_utils rdma_cm iw_cm ib_cm cifs_md4 dns_resolver netfs qrtr
      rfkill sunrpc vfat fat intel_rapl_msr intel_rapl_common irdma
      intel_uncore_frequency intel_uncore_frequency_common ice ipmi_ssif
      isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal
      intel_powerclamp gnss coretemp ib_uverbs rapl intel_cstate ib_core
      iTCO_wdt iTCO_vendor_support acpi_ipmi mei_me ipmi_si intel_uncore
      ioatdma i2c_i801 joydev pcspkr mei ipmi_devintf lpc_ich
      intel_pch_thermal i2c_smbus ipmi_msghandler acpi_power_meter acpi_pad
      xfs libcrc32c ast sd_mod drm_shmem_helper t10_pi drm_kms_helper sg ixgbe
      drm i40e ahci crct10dif_pclmul libahci crc32_pclmul igb crc32c_intel
      libata ghash_clmulni_intel i2c_algo_bit mdio dca wmi dm_mirror
      dm_region_hash dm_log dm_mod fuse
      [  +0.000050] CPU: 0 PID: 937 Comm: kworker/0:3 Kdump: loaded Not
      tainted 6.8.0-rc2-Feb-net_dev-Qiueue-00279-gbd43c5687e05 #1
      [  +0.000003] Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS
      SE5C620.86B.02.01.0013.121520200651 12/15/2020
      [  +0.000001] Workqueue: i40e i40e_service_task [i40e]
      [  +0.000024] RIP: 0010:check_flush_dependency+0x10b/0x120
      [  +0.000003] Code: ff 49 8b 54 24 18 48 8d 8b b0 00 00 00 49 89 e8 48
      81 c6 b0 00 00 00 48 c7 c7 b0 97 fa 9f c6 05 8a cc 1f 02 01 e8 35 b3 fd
      ff <0f> 0b e9 10 ff ff ff 80 3d 78 cc 1f 02 00 75 94 e9 46 ff ff ff 90
      [  +0.000002] RSP: 0018:ffffbd294976bcf8 EFLAGS: 00010282
      [  +0.000002] RAX: 0000000000000000 RBX: ffff94d4c483c000 RCX:
      0000000000000027
      [  +0.000001] RDX: ffff94d47f620bc8 RSI: 0000000000000001 RDI:
      ffff94d47f620bc0
      [  +0.000001] RBP: 0000000000000000 R08: 0000000000000000 R09:
      00000000ffff7fff
      [  +0.000001] R10: ffffbd294976bb98 R11: ffffffffa0be65e8 R12:
      ffff94c5451ea180
      [  +0.000001] R13: ffff94c5ab5e8000 R14: ffff94c5c20b6e05 R15:
      ffff94c5f1330ab0
      [  +0.000001] FS:  0000000000000000(0000) GS:ffff94d47f600000(0000)
      knlGS:0000000000000000
      [  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  +0.000001] CR2: 00007f9e6f1fca70 CR3: 0000000038e20004 CR4:
      00000000007706f0
      [  +0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [  +0.000001] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
      0000000000000400
      [  +0.000001] PKRU: 55555554
      [  +0.000001] Call Trace:
      [  +0.000001]  <TASK>
      [  +0.000002]  ? __warn+0x80/0x130
      [  +0.000003]  ? check_flush_dependency+0x10b/0x120
      [  +0.000002]  ? report_bug+0x195/0x1a0
      [  +0.000005]  ? handle_bug+0x3c/0x70
      [  +0.000003]  ? exc_invalid_op+0x14/0x70
      [  +0.000002]  ? asm_exc_invalid_op+0x16/0x20
      [  +0.000006]  ? check_flush_dependency+0x10b/0x120
      [  +0.000002]  ? check_flush_dependency+0x10b/0x120
      [  +0.000002]  __flush_workqueue+0x126/0x3f0
      [  +0.000015]  ib_cache_cleanup_one+0x1c/0xe0 [ib_core]
      [  +0.000056]  __ib_unregister_device+0x6a/0xb0 [ib_core]
      [  +0.000023]  ib_unregister_device_and_put+0x34/0x50 [ib_core]
      [  +0.000020]  i40iw_close+0x4b/0x90 [irdma]
      [  +0.000022]  i40e_notify_client_of_netdev_close+0x54/0xc0 [i40e]
      [  +0.000035]  i40e_service_task+0x126/0x190 [i40e]
      [  +0.000024]  process_one_work+0x174/0x340
      [  +0.000003]  worker_thread+0x27e/0x390
      [  +0.000001]  ? __pfx_worker_thread+0x10/0x10
      [  +0.000002]  kthread+0xdf/0x110
      [  +0.000002]  ? __pfx_kthread+0x10/0x10
      [  +0.000002]  ret_from_fork+0x2d/0x50
      [  +0.000003]  ? __pfx_kthread+0x10/0x10
      [  +0.000001]  ret_from_fork_asm+0x1b/0x30
      [  +0.000004]  </TASK>
      [  +0.000001] ---[ end trace 0000000000000000 ]---
      
      Fixes: 4d5957cb ("i40e: remove WQ_UNBOUND and the task limit of our workqueue")
      Signed-off-by: default avatarSindhu Devale <sindhu.devale@intel.com>
      Reviewed-by: default avatarArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Reviewed-by: default avatarMateusz Polchlopek <mateusz.polchlopek@intel.com>
      Signed-off-by: default avatarAleksandr Loktionov <aleksandr.loktionov@intel.com>
      Tested-by: default avatarRobert Ganzynkowicz <robert.ganzynkowicz@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://lore.kernel.org/r/20240423182723.740401-2-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2cc7d150
    • Dan Carpenter's avatar
      net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns() · 4dcd0e83
      Dan Carpenter authored
      The rx_chn->irq[] array is unsigned int but it should be signed for the
      error handling to work.  Also if k3_udma_glue_rx_get_irq() returns zero
      then we should return -ENXIO instead of success.
      
      Fixes: 128d5874 ("net: ti: icssg-prueth: Add ICSSG ethernet driver")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarRoger Quadros <rogerq@kernel.org>
      Reviewed-by: default avatarMD Danish Anwar <danishanwar@ti.com>
      Link: https://lore.kernel.org/r/05282415-e7f4-42f3-99f8-32fde8f30936@moroto.mountainSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4dcd0e83
    • Rahul Rameshbabu's avatar
      net/mlx5e: Advertise mlx5 ethernet driver updates sk_buff md_dst for MACsec · 39d26a8f
      Rahul Rameshbabu authored
      mlx5 Rx flow steering and CQE handling enable the driver to be able to
      update an skb's md_dst attribute as MACsec when MACsec traffic arrives when
      a device is configured for offloading. Advertise this to the core stack to
      take advantage of this capability.
      
      Cc: stable@vger.kernel.org
      Fixes: b7c9400c ("net/mlx5e: Implement MACsec Rx data path using MACsec skb_metadata_dst")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarBenjamin Poirier <bpoirier@nvidia.com>
      Reviewed-by: default avatarCosmin Ratiu <cratiu@nvidia.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/20240423181319.115860-5-rrameshbabu@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      39d26a8f
    • Rahul Rameshbabu's avatar
      macsec: Detect if Rx skb is macsec-related for offloading devices that update md_dst · 642c984d
      Rahul Rameshbabu authored
      Can now correctly identify where the packets should be delivered by using
      md_dst or its absence on devices that provide it.
      
      This detection is not possible without device drivers that update md_dst. A
      fallback pattern should be used for supporting such device drivers. This
      fallback mode causes multicast messages to be cloned to both the non-macsec
      and macsec ports, independent of whether the multicast message received was
      encrypted over MACsec or not. Other non-macsec traffic may also fail to be
      handled correctly for devices in promiscuous mode.
      
      Link: https://lore.kernel.org/netdev/ZULRxX9eIbFiVi7v@hog/
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: stable@vger.kernel.org
      Fixes: 860ead89 ("net/macsec: Add MACsec skb_metadata_dst Rx Data path support")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarBenjamin Poirier <bpoirier@nvidia.com>
      Reviewed-by: default avatarCosmin Ratiu <cratiu@nvidia.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/20240423181319.115860-4-rrameshbabu@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      642c984d
    • Rahul Rameshbabu's avatar
      ethernet: Add helper for assigning packet type when dest address does not match device address · 6e159fd6
      Rahul Rameshbabu authored
      Enable reuse of logic in eth_type_trans for determining packet type.
      Suggested-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/20240423181319.115860-3-rrameshbabu@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6e159fd6
    • Rahul Rameshbabu's avatar
      macsec: Enable devices to advertise whether they update sk_buff md_dst during offloads · 475747a1
      Rahul Rameshbabu authored
      Cannot know whether a Rx skb missing md_dst is intended for MACsec or not
      without knowing whether the device is able to update this field during an
      offload. Assume that an offload to a MACsec device cannot support updating
      md_dst by default. Capable devices can advertise that they do indicate that
      an skb is related to a MACsec offloaded packet using the md_dst.
      
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: stable@vger.kernel.org
      Fixes: 860ead89 ("net/macsec: Add MACsec skb_metadata_dst Rx Data path support")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarBenjamin Poirier <bpoirier@nvidia.com>
      Reviewed-by: default avatarCosmin Ratiu <cratiu@nvidia.com>
      Reviewed-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/20240423181319.115860-2-rrameshbabu@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      475747a1
    • David S. Miller's avatar
      Merge tag 'wireless-2024-04-23' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless · 46bf0c9a
      David S. Miller authored
      Johannes berg says:
      
      ====================
      Fixes for the current cycle:
       * ath11k: convert to correct RCU iteration of IPv6 addresses
       * iwlwifi: link ID, FW API version, scanning and PASN fixes
       * cfg80211: NULL-deref and tracing fixes
       * mac80211: connection mode, mesh fast-TX, multi-link and
                   various other small fixes
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46bf0c9a
    • MD Danish Anwar's avatar
      net: phy: dp83869: Fix MII mode failure · 6c9cd59d
      MD Danish Anwar authored
      The DP83869 driver sets the MII bit (needed for PHY to work in MII mode)
      only if the op-mode is either DP83869_100M_MEDIA_CONVERT or
      DP83869_RGMII_100_BASE.
      
      Some drivers i.e. ICSSG support MII mode with op-mode as
      DP83869_RGMII_COPPER_ETHERNET for which the MII bit is not set in dp83869
      driver. As a result MII mode on ICSSG doesn't work and below log is seen.
      
      TI DP83869 300b2400.mdio:0f: selected op-mode is not valid with MII mode
      icssg-prueth icssg1-eth: couldn't connect to phy ethernet-phy@0
      icssg-prueth icssg1-eth: can't phy connect port MII0
      
      Fix this by setting MII bit for DP83869_RGMII_COPPER_ETHERNET op-mode as
      well.
      
      Fixes: 94e86ef1 ("net: phy: dp83869: support mii mode when rgmii strap cfg is used")
      Signed-off-by: default avatarMD Danish Anwar <danishanwar@ti.com>
      Reviewed-by: default avatarRavi Gunasekaran <r-gunasekaran@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c9cd59d
    • Jakub Kicinski's avatar
      Merge tag 'for-net-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · e6b21901
      Jakub Kicinski authored
      Luiz Augusto von Dentz says:
      
      ====================
      bluetooth pull request for net:
      
       - qca: set power_ctrl_enabled on NULL returned by gpiod_get_optional()
       - hci_sync: Using hci_cmd_sync_submit when removing Adv Monitor
       - qca: fix invalid device address check
       - hci_sync: Use advertised PHYs on hci_le_ext_create_conn_sync
       - Fix type of len in {l2cap,sco}_sock_getsockopt_old()
       - btusb: mediatek: Fix double free of skb in coredump
       - btusb: Add Realtek RTL8852BE support ID 0x0bda:0x4853
       - btusb: Fix triggering coredump implementation for QCA
      
      * tag 'for-net-2024-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
        Bluetooth: qca: set power_ctrl_enabled on NULL returned by gpiod_get_optional()
        Bluetooth: hci_sync: Using hci_cmd_sync_submit when removing Adv Monitor
        Bluetooth: qca: fix NULL-deref on non-serdev setup
        Bluetooth: qca: fix NULL-deref on non-serdev suspend
        Bluetooth: btusb: mediatek: Fix double free of skb in coredump
        Bluetooth: MGMT: Fix failing to MGMT_OP_ADD_UUID/MGMT_OP_REMOVE_UUID
        Bluetooth: qca: fix invalid device address check
        Bluetooth: hci_event: Fix sending HCI_OP_READ_ENC_KEY_SIZE
        Bluetooth: btusb: Fix triggering coredump implementation for QCA
        Bluetooth: btusb: Add Realtek RTL8852BE support ID 0x0bda:0x4853
        Bluetooth: hci_sync: Use advertised PHYs on hci_le_ext_create_conn_sync
        Bluetooth: Fix type of len in {l2cap,sco}_sock_getsockopt_old()
      ====================
      
      Link: https://lore.kernel.org/r/20240424204102.2319483-1-luiz.dentz@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e6b21901
    • Jakub Kicinski's avatar
      eth: bnxt: fix counting packets discarded due to OOM and netpoll · 73011773
      Jakub Kicinski authored
      I added OOM and netpoll discard counters, naively assuming that
      the cpr pointer is pointing to a common completion ring.
      Turns out that is usually *a* completion ring but not *the*
      completion ring which bnapi->cp_ring points to. bnapi->cp_ring
      is where the stats are read from, so we end up reporting 0
      thru ethtool -S and qstat even though the drop events have happened.
      Make 100% sure we're recording statistics in the correct structure.
      
      Fixes: 907fd4a2 ("bnxt: count discards due to memory allocation errors")
      Reviewed-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Link: https://lore.kernel.org/r/20240424002148.3937059-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      73011773
    • Lukas Wunner's avatar
      igc: Fix LED-related deadlock on driver unbind · c04d1b9e
      Lukas Wunner authored
      Roman reports a deadlock on unplug of a Thunderbolt docking station
      containing an Intel I225 Ethernet adapter.
      
      The root cause is that led_classdev's for LEDs on the adapter are
      registered such that they're device-managed by the netdev.  That
      results in recursive acquisition of the rtnl_lock() mutex on unplug:
      
      When the driver calls unregister_netdev(), it acquires rtnl_lock(),
      then frees the device-managed resources.  Upon unregistering the LEDs,
      netdev_trig_deactivate() invokes unregister_netdevice_notifier(),
      which tries to acquire rtnl_lock() again.
      
      Avoid by using non-device-managed LED registration.
      
      Stack trace for posterity:
      
        schedule+0x6e/0xf0
        schedule_preempt_disabled+0x15/0x20
        __mutex_lock+0x2a0/0x750
        unregister_netdevice_notifier+0x40/0x150
        netdev_trig_deactivate+0x1f/0x60 [ledtrig_netdev]
        led_trigger_set+0x102/0x330
        led_classdev_unregister+0x4b/0x110
        release_nodes+0x3d/0xb0
        devres_release_all+0x8b/0xc0
        device_del+0x34f/0x3c0
        unregister_netdevice_many_notify+0x80b/0xaf0
        unregister_netdev+0x7c/0xd0
        igc_remove+0xd8/0x1e0 [igc]
        pci_device_remove+0x3f/0xb0
      
      Fixes: ea578703 ("igc: Add support for LEDs on i225/i226")
      Reported-by: default avatarRoman Lozko <lozko.roma@gmail.com>
      Closes: https://lore.kernel.org/r/CAEhC_B=ksywxCG_+aQqXUrGEgKq+4mqnSV8EBHOKbC3-Obj9+Q@mail.gmail.com/Reported-by: default avatar"Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
      Closes: https://lore.kernel.org/r/ZhRD3cOtz5i-61PB@mail-itl/Signed-off-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: Heiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: Kurt Kanzenbach's avatarKurt Kanzenbach <kurt@linutronix.de>
      Tested-by: Kurt Kanzenbach <kurt@linutronix.de> # Intel i225
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://lore.kernel.org/r/20240422204503.225448-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c04d1b9e
    • Duanqiang Wen's avatar
      Revert "net: txgbe: fix clk_name exceed MAX_DEV_ID limits" · edd2d250
      Duanqiang Wen authored
      This reverts commit e30cef00.
      commit 99f4570c ("clkdev: Update clkdev id usage to allow
      for longer names") can fix clk_name exceed MAX_DEV_ID limits,
      so this commit is meaningless.
      Signed-off-by: default avatarDuanqiang Wen <duanqiangwen@net-swift.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240422084109.3201-2-duanqiangwen@net-swift.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      edd2d250
    • Duanqiang Wen's avatar
      Revert "net: txgbe: fix i2c dev name cannot match clkdev" · 8d6bf83f
      Duanqiang Wen authored
      This reverts commit c644920c.
      when register i2c dev, txgbe shorten "i2c_designware" to "i2c_dw",
      will cause this i2c dev can't match platfom driver i2c_designware_platform.
      Signed-off-by: default avatarDuanqiang Wen <duanqiangwen@net-swift.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240422084109.3201-1-duanqiangwen@net-swift.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8d6bf83f
    • Jakub Kicinski's avatar
      Merge branch 'mlxsw-various-acl-fixes' · 04816dc9
      Jakub Kicinski authored
      Petr Machata says:
      
      ====================
      mlxsw: Various ACL fixes
      
      Ido Schimmel writes:
      
      Fix various problems in the ACL (i.e., flower offload) code. See the
      commit messages for more details.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1713797103.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      04816dc9
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Fix memory leak when canceling rehash work · fb4e2b70
      Ido Schimmel authored
      The rehash delayed work is rescheduled with a delay if the number of
      credits at end of the work is not negative as supposedly it means that
      the migration ended. Otherwise, it is rescheduled immediately.
      
      After "mlxsw: spectrum_acl_tcam: Fix possible use-after-free during
      rehash" the above is no longer accurate as a non-negative number of
      credits is no longer indicative of the migration being done. It can also
      happen if the work encountered an error in which case the migration will
      resume the next time the work is scheduled.
      
      The significance of the above is that it is possible for the work to be
      pending and associated with hints that were allocated when the migration
      started. This leads to the hints being leaked [1] when the work is
      canceled while pending as part of ACL region dismantle.
      
      Fix by freeing the hints if hints are associated with a work that was
      canceled while pending.
      
      Blame the original commit since the reliance on not having a pending
      work associated with hints is fragile.
      
      [1]
      unreferenced object 0xffff88810e7c3000 (size 256):
        comm "kworker/0:16", pid 176, jiffies 4295460353
        hex dump (first 32 bytes):
          00 30 95 11 81 88 ff ff 61 00 00 00 00 00 00 80  .0......a.......
          00 00 61 00 40 00 00 00 00 00 00 00 04 00 00 00  ..a.@...........
        backtrace (crc 2544ddb9):
          [<00000000cf8cfab3>] kmalloc_trace+0x23f/0x2a0
          [<000000004d9a1ad9>] objagg_hints_get+0x42/0x390
          [<000000000b143cf3>] mlxsw_sp_acl_erp_rehash_hints_get+0xca/0x400
          [<0000000059bdb60a>] mlxsw_sp_acl_tcam_vregion_rehash_work+0x868/0x1160
          [<00000000e81fd734>] process_one_work+0x59c/0xf20
          [<00000000ceee9e81>] worker_thread+0x799/0x12c0
          [<00000000bda6fe39>] kthread+0x246/0x300
          [<0000000070056d23>] ret_from_fork+0x34/0x70
          [<00000000dea2b93e>] ret_from_fork_asm+0x1a/0x30
      
      Fixes: c9c9af91 ("mlxsw: spectrum_acl: Allow to interrupt/continue rehash work")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/0cc12ebb07c4d4c41a1265ee2c28b392ff997a86.1713797103.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fb4e2b70
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Fix incorrect list API usage · b377add0
      Ido Schimmel authored
      Both the function that migrates all the chunks within a region and the
      function that migrates all the entries within a chunk call
      list_first_entry() on the respective lists without checking that the
      lists are not empty. This is incorrect usage of the API, which leads to
      the following warning [1].
      
      Fix by returning if the lists are empty as there is nothing to migrate
      in this case.
      
      [1]
      WARNING: CPU: 0 PID: 6437 at drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c:1266 mlxsw_sp_acl_tcam_vchunk_migrate_all+0x1f1/0>
      Modules linked in:
      CPU: 0 PID: 6437 Comm: kworker/0:37 Not tainted 6.9.0-rc3-custom-00883-g94a65f079ef6 #39
      Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019
      Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work
      RIP: 0010:mlxsw_sp_acl_tcam_vchunk_migrate_all+0x1f1/0x2c0
      [...]
      Call Trace:
       <TASK>
       mlxsw_sp_acl_tcam_vregion_rehash_work+0x6c/0x4a0
       process_one_work+0x151/0x370
       worker_thread+0x2cb/0x3e0
       kthread+0xd0/0x100
       ret_from_fork+0x34/0x50
       ret_from_fork_asm+0x1a/0x30
       </TASK>
      
      Fixes: 6f9579d4 ("mlxsw: spectrum_acl: Remember where to continue rehash migration")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/4628e9a22d1d84818e28310abbbc498e7bc31bc9.1713797103.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b377add0
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Fix warning during rehash · 743edc85
      Ido Schimmel authored
      As previously explained, the rehash delayed work migrates filters from
      one region to another. This is done by iterating over all chunks (all
      the filters with the same priority) in the region and in each chunk
      iterating over all the filters.
      
      When the work runs out of credits it stores the current chunk and entry
      as markers in the per-work context so that it would know where to resume
      the migration from the next time the work is scheduled.
      
      Upon error, the chunk marker is reset to NULL, but without resetting the
      entry markers despite being relative to it. This can result in migration
      being resumed from an entry that does not belong to the chunk being
      migrated. In turn, this will eventually lead to a chunk being iterated
      over as if it is an entry. Because of how the two structures happen to
      be defined, this does not lead to KASAN splats, but to warnings such as
      [1].
      
      Fix by creating a helper that resets all the markers and call it from
      all the places the currently only reset the chunk marker. For good
      measures also call it when starting a completely new rehash. Add a
      warning to avoid future cases.
      
      [1]
      WARNING: CPU: 7 PID: 1076 at drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c:407 mlxsw_afk_encode+0x242/0x2f0
      Modules linked in:
      CPU: 7 PID: 1076 Comm: kworker/7:24 Tainted: G        W          6.9.0-rc3-custom-00880-g29e61d91b77b #29
      Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019
      Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work
      RIP: 0010:mlxsw_afk_encode+0x242/0x2f0
      [...]
      Call Trace:
       <TASK>
       mlxsw_sp_acl_atcam_entry_add+0xd9/0x3c0
       mlxsw_sp_acl_tcam_entry_create+0x5e/0xa0
       mlxsw_sp_acl_tcam_vchunk_migrate_all+0x109/0x290
       mlxsw_sp_acl_tcam_vregion_rehash_work+0x6c/0x470
       process_one_work+0x151/0x370
       worker_thread+0x2cb/0x3e0
       kthread+0xd0/0x100
       ret_from_fork+0x34/0x50
       </TASK>
      
      Fixes: 6f9579d4 ("mlxsw: spectrum_acl: Remember where to continue rehash migration")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/cc17eed86b41dd829d39b07906fec074a9ce580e.1713797103.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      743edc85
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Fix memory leak during rehash · 8ca3f7a7
      Ido Schimmel authored
      The rehash delayed work migrates filters from one region to another.
      This is done by iterating over all chunks (all the filters with the same
      priority) in the region and in each chunk iterating over all the
      filters.
      
      If the migration fails, the code tries to migrate the filters back to
      the old region. However, the rollback itself can also fail in which case
      another migration will be erroneously performed. Besides the fact that
      this ping pong is not a very good idea, it also creates a problem.
      
      Each virtual chunk references two chunks: The currently used one
      ('vchunk->chunk') and a backup ('vchunk->chunk2'). During migration the
      first holds the chunk we want to migrate filters to and the second holds
      the chunk we are migrating filters from.
      
      The code currently assumes - but does not verify - that the backup chunk
      does not exist (NULL) if the currently used chunk does not reference the
      target region. This assumption breaks when we are trying to rollback a
      rollback, resulting in the backup chunk being overwritten and leaked
      [1].
      
      Fix by not rolling back a failed rollback and add a warning to avoid
      future cases.
      
      [1]
      WARNING: CPU: 5 PID: 1063 at lib/parman.c:291 parman_destroy+0x17/0x20
      Modules linked in:
      CPU: 5 PID: 1063 Comm: kworker/5:11 Tainted: G        W          6.9.0-rc2-custom-00784-gc6a05c468a0b #14
      Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019
      Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work
      RIP: 0010:parman_destroy+0x17/0x20
      [...]
      Call Trace:
       <TASK>
       mlxsw_sp_acl_atcam_region_fini+0x19/0x60
       mlxsw_sp_acl_tcam_region_destroy+0x49/0xf0
       mlxsw_sp_acl_tcam_vregion_rehash_work+0x1f1/0x470
       process_one_work+0x151/0x370
       worker_thread+0x2cb/0x3e0
       kthread+0xd0/0x100
       ret_from_fork+0x34/0x50
       ret_from_fork_asm+0x1a/0x30
       </TASK>
      
      Fixes: 84350051 ("mlxsw: spectrum_acl: Do rollback as another call to mlxsw_sp_acl_tcam_vchunk_migrate_all()")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/d5edd4f4503934186ae5cfe268503b16345b4e0f.1713797103.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8ca3f7a7
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Rate limit error message · 5bcf9255
      Ido Schimmel authored
      In the rare cases when the device resources are exhausted it is likely
      that the rehash delayed work will fail. An error message will be printed
      whenever this happens which can be overwhelming considering the fact
      that the work is per-region and that there can be hundreds of regions.
      
      Fix by rate limiting the error message.
      
      Fixes: e5e7962e ("mlxsw: spectrum_acl: Implement region migration according to hints")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/c510763b2ebd25e7990d80183feff91cde593145.1713797103.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5bcf9255
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Fix possible use-after-free during rehash · 54225988
      Ido Schimmel authored
      The rehash delayed work migrates filters from one region to another
      according to the number of available credits.
      
      The migrated from region is destroyed at the end of the work if the
      number of credits is non-negative as the assumption is that this is
      indicative of migration being complete. This assumption is incorrect as
      a non-negative number of credits can also be the result of a failed
      migration.
      
      The destruction of a region that still has filters referencing it can
      result in a use-after-free [1].
      
      Fix by not destroying the region if migration failed.
      
      [1]
      BUG: KASAN: slab-use-after-free in mlxsw_sp_acl_ctcam_region_entry_remove+0x21d/0x230
      Read of size 8 at addr ffff8881735319e8 by task kworker/0:31/3858
      
      CPU: 0 PID: 3858 Comm: kworker/0:31 Tainted: G        W          6.9.0-rc2-custom-00782-gf2275c2157d8 #5
      Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019
      Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work
      Call Trace:
       <TASK>
       dump_stack_lvl+0xc6/0x120
       print_report+0xce/0x670
       kasan_report+0xd7/0x110
       mlxsw_sp_acl_ctcam_region_entry_remove+0x21d/0x230
       mlxsw_sp_acl_ctcam_entry_del+0x2e/0x70
       mlxsw_sp_acl_atcam_entry_del+0x81/0x210
       mlxsw_sp_acl_tcam_vchunk_migrate_all+0x3cd/0xb50
       mlxsw_sp_acl_tcam_vregion_rehash_work+0x157/0x1300
       process_one_work+0x8eb/0x19b0
       worker_thread+0x6c9/0xf70
       kthread+0x2c9/0x3b0
       ret_from_fork+0x4d/0x80
       ret_from_fork_asm+0x1a/0x30
       </TASK>
      
      Allocated by task 174:
       kasan_save_stack+0x33/0x60
       kasan_save_track+0x14/0x30
       __kasan_kmalloc+0x8f/0xa0
       __kmalloc+0x19c/0x360
       mlxsw_sp_acl_tcam_region_create+0xdf/0x9c0
       mlxsw_sp_acl_tcam_vregion_rehash_work+0x954/0x1300
       process_one_work+0x8eb/0x19b0
       worker_thread+0x6c9/0xf70
       kthread+0x2c9/0x3b0
       ret_from_fork+0x4d/0x80
       ret_from_fork_asm+0x1a/0x30
      
      Freed by task 7:
       kasan_save_stack+0x33/0x60
       kasan_save_track+0x14/0x30
       kasan_save_free_info+0x3b/0x60
       poison_slab_object+0x102/0x170
       __kasan_slab_free+0x14/0x30
       kfree+0xc1/0x290
       mlxsw_sp_acl_tcam_region_destroy+0x272/0x310
       mlxsw_sp_acl_tcam_vregion_rehash_work+0x731/0x1300
       process_one_work+0x8eb/0x19b0
       worker_thread+0x6c9/0xf70
       kthread+0x2c9/0x3b0
       ret_from_fork+0x4d/0x80
       ret_from_fork_asm+0x1a/0x30
      
      Fixes: c9c9af91 ("mlxsw: spectrum_acl: Allow to interrupt/continue rehash work")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/3e412b5659ec2310c5c615760dfe5eac18dd7ebd.1713797103.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      54225988
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Fix possible use-after-free during activity update · 79b5b4b1
      Ido Schimmel authored
      The rule activity update delayed work periodically traverses the list of
      configured rules and queries their activity from the device.
      
      As part of this task it accesses the entry pointed by 'ventry->entry',
      but this entry can be changed concurrently by the rehash delayed work,
      leading to a use-after-free [1].
      
      Fix by closing the race and perform the activity query under the
      'vregion->lock' mutex.
      
      [1]
      BUG: KASAN: slab-use-after-free in mlxsw_sp_acl_tcam_flower_rule_activity_get+0x121/0x140
      Read of size 8 at addr ffff8881054ed808 by task kworker/0:18/181
      
      CPU: 0 PID: 181 Comm: kworker/0:18 Not tainted 6.9.0-rc2-custom-00781-gd5ab772d32f7 #2
      Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019
      Workqueue: mlxsw_core mlxsw_sp_acl_rule_activity_update_work
      Call Trace:
       <TASK>
       dump_stack_lvl+0xc6/0x120
       print_report+0xce/0x670
       kasan_report+0xd7/0x110
       mlxsw_sp_acl_tcam_flower_rule_activity_get+0x121/0x140
       mlxsw_sp_acl_rule_activity_update_work+0x219/0x400
       process_one_work+0x8eb/0x19b0
       worker_thread+0x6c9/0xf70
       kthread+0x2c9/0x3b0
       ret_from_fork+0x4d/0x80
       ret_from_fork_asm+0x1a/0x30
       </TASK>
      
      Allocated by task 1039:
       kasan_save_stack+0x33/0x60
       kasan_save_track+0x14/0x30
       __kasan_kmalloc+0x8f/0xa0
       __kmalloc+0x19c/0x360
       mlxsw_sp_acl_tcam_entry_create+0x7b/0x1f0
       mlxsw_sp_acl_tcam_vchunk_migrate_all+0x30d/0xb50
       mlxsw_sp_acl_tcam_vregion_rehash_work+0x157/0x1300
       process_one_work+0x8eb/0x19b0
       worker_thread+0x6c9/0xf70
       kthread+0x2c9/0x3b0
       ret_from_fork+0x4d/0x80
       ret_from_fork_asm+0x1a/0x30
      
      Freed by task 1039:
       kasan_save_stack+0x33/0x60
       kasan_save_track+0x14/0x30
       kasan_save_free_info+0x3b/0x60
       poison_slab_object+0x102/0x170
       __kasan_slab_free+0x14/0x30
       kfree+0xc1/0x290
       mlxsw_sp_acl_tcam_vchunk_migrate_all+0x3d7/0xb50
       mlxsw_sp_acl_tcam_vregion_rehash_work+0x157/0x1300
       process_one_work+0x8eb/0x19b0
       worker_thread+0x6c9/0xf70
       kthread+0x2c9/0x3b0
       ret_from_fork+0x4d/0x80
       ret_from_fork_asm+0x1a/0x30
      
      Fixes: 2bffc532 ("mlxsw: spectrum_acl: Don't take mutex in mlxsw_sp_acl_tcam_vregion_rehash_work()")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/1fcce0a60b231ebeb2515d91022284ba7b4ffe7a.1713797103.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      79b5b4b1
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Fix race during rehash delayed work · d90cfe20
      Ido Schimmel authored
      The purpose of the rehash delayed work is to reduce the number of masks
      (eRPs) used by an ACL region as the eRP bank is a global and limited
      resource.
      
      This is done in three steps:
      
      1. Creating a new set of masks and a new ACL region which will use the
         new masks and to which the existing filters will be migrated to. The
         new region is assigned to 'vregion->region' and the region from which
         the filters are migrated from is assigned to 'vregion->region2'.
      
      2. Migrating all the filters from the old region to the new region.
      
      3. Destroying the old region and setting 'vregion->region2' to NULL.
      
      Only the second steps is performed under the 'vregion->lock' mutex
      although its comments says that among other things it "Protects
      consistency of region, region2 pointers".
      
      This is problematic as the first step can race with filter insertion
      from user space that uses 'vregion->region', but under the mutex.
      
      Fix by holding the mutex across the entirety of the delayed work and not
      only during the second step.
      
      Fixes: 2bffc532 ("mlxsw: spectrum_acl: Don't take mutex in mlxsw_sp_acl_tcam_vregion_rehash_work()")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/1ec1d54edf2bad0a369e6b4fa030aba64e1f124b.1713797103.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d90cfe20
    • Ido Schimmel's avatar
      mlxsw: spectrum_acl_tcam: Fix race in region ID allocation · 627f9c1b
      Ido Schimmel authored
      Region identifiers can be allocated both when user space tries to insert
      a new tc filter and when filters are migrated from one region to another
      as part of the rehash delayed work.
      
      There is no lock protecting the bitmap from which these identifiers are
      allocated from, which is racy and leads to bad parameter errors from the
      device's firmware.
      
      Fix by converting the bitmap to IDA which handles its own locking. For
      consistency, do the same for the group identifiers that are part of the
      same structure.
      
      Fixes: 2bffc532 ("mlxsw: spectrum_acl: Don't take mutex in mlxsw_sp_acl_tcam_vregion_rehash_work()")
      Reported-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarAlexander Zubkov <green@qrator.net>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/ce494b7940cadfe84f3e18da7785b51ef5f776e3.1713797103.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      627f9c1b
    • Hyunwoo Kim's avatar
      net: openvswitch: Fix Use-After-Free in ovs_ct_exit · 5ea7b72d
      Hyunwoo Kim authored
      Since kfree_rcu, which is called in the hlist_for_each_entry_rcu traversal
      of ovs_ct_limit_exit, is not part of the RCU read critical section, it
      is possible that the RCU grace period will pass during the traversal and
      the key will be free.
      
      To prevent this, it should be changed to hlist_for_each_entry_safe.
      
      Fixes: 11efd5cb ("openvswitch: Support conntrack zone limit")
      Signed-off-by: default avatarHyunwoo Kim <v4bel@theori.io>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarAaron Conole <aconole@redhat.com>
      Link: https://lore.kernel.org/r/ZiYvzQN/Ry5oeFQW@v4bel-B760M-AORUS-ELITE-AXSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5ea7b72d
  2. 24 Apr, 2024 7 commits