1. 08 Dec, 2022 40 commits
    • Jacob Keller's avatar
      ice: only check set bits in ice_ptp_flush_tx_tracker · e3ba5248
      Jacob Keller authored
      The ice_ptp_flush_tx_tracker function is called to clear all outstanding Tx
      timestamp requests when the port is being brought down. This function
      iterates over the entire list, but this is unnecessary. We only need to
      check the bits which are actually set in the ready bitmap.
      
      Replace this logic with for_each_set_bit, and follow a similar flow as in
      ice_ptp_tx_tstamp_cleanup. Note that it is safe to call dev_kfree_skb_any
      on a NULL pointer as it will perform a no-op so we do not need to verify
      that the skb is actually NULL.
      
      The new implementation also avoids clearing (and thus reading!) the PHY
      timestamp unless the index is marked as having a valid timestamp in the
      timestamp status bitmap. This ensures that we properly clear the status
      registers as appropriate.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      e3ba5248
    • Jacob Keller's avatar
      ice: handle flushing stale Tx timestamps in ice_ptp_tx_tstamp · d40fd600
      Jacob Keller authored
      In the event of a PTP clock time change due to .adjtime or .settime, the
      ice driver needs to update the cached copy of the PHC time and also discard
      any outstanding Tx timestamps.
      
      This is required because otherwise the wrong copy of the PHC time will be
      used when extending the Tx timestamp. This could result in reporting
      incorrect timestamps to the stack.
      
      The current approach taken to handle this is to call
      ice_ptp_flush_tx_tracker, which will discard any timestamps which are not
      yet complete.
      
      This is problematic for two reasons:
      
      1) it could lead to a potential race condition where the wrong timestamp is
         associated with a future packet.
      
         This can occur with the following flow:
      
         1. Thread A gets request to transmit a timestamped packet, and picks an
            index and transmits the packet
      
         2. Thread B calls ice_ptp_flush_tx_tracker and sees the index in use,
            marking is as disarded. No timestamp read occurs because the status
            bit is not set, but the index is released for re-use
      
         3. Thread A gets a new request to transmit another timestamped packet,
            picks the same (now unused) index and transmits that packet.
      
         4. The PHY transmits the first packet and updates the timestamp slot and
            generates an interrupt.
      
         5. The ice_ptp_tx_tstamp thread executes and sees the interrupt and a
            valid timestamp but associates it with the new Tx SKB and not the one
            that actual timestamp for the packet as expected.
      
         This could result in the previous timestamp being assigned to a new
         packet producing incorrect timestamps and leading to incorrect behavior
         in PTP applications.
      
         This is most likely to occur when the packet rate for Tx timestamp
         requests is very high.
      
      2) on E822 hardware, we must avoid reading a timestamp index more than once
         each time its status bit is set and an interrupt is generated by
         hardware.
      
         We do have some extensive checks for the unread flag to ensure that only
         one of either the ice_ptp_flush_tx_tracker or ice_ptp_tx_tstamp threads
         read the timestamp. However, even with this we can still have cases
         where we "flush" a timestamp that was actually completed in hardware.
         This can lead to cases where we don't read the timestamp index as
         appropriate.
      
      To fix both of these issues, we must avoid calling ice_ptp_flush_tx_tracker
      outside of the teardown path.
      
      Rather than using ice_ptp_flush_tx_tracker, introduce a new state bitmap,
      the stale bitmap. Start this as cleared when we begin a new timestamp
      request. When we're about to extend a timestamp and send it up to the
      stack, first check to see if that stale bit was set. If so, drop the
      timestamp without sending it to the stack.
      
      When we need to update the cached PHC timestamp out of band, just mark all
      currently outstanding timestamps as stale. This will ensure that once
      hardware completes the timestamp we'll ignore it correctly and avoid
      reporting bogus timestamps to userspace.
      
      With this change, we fix potential issues caused  by calling
      ice_ptp_flush_tx_tracker during normal operation.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      d40fd600
    • Jacob Keller's avatar
      ice: cleanup allocations in ice_ptp_alloc_tx_tracker · c1f3414d
      Jacob Keller authored
      The ice_ptp_alloc_tx_tracker function must allocate the timestamp array and
      the bitmap for tracking the currently in use indexes. A future change is
      going to add yet another allocation to this function.
      
      If these allocations fail we need to ensure that we properly cleanup and
      ensure that the pointers in the ice_ptp_tx structure are NULL.
      
      Simplify this logic by allocating to local variables first. If any
      allocation fails, then free everything and exit. Only update the ice_ptp_tx
      structure if all allocations succeed.
      
      This ensures that we have no side effects on the Tx structure unless all
      allocations have succeeded. Thus, no code will see an invalid pointer and
      we don't need to re-assign NULL on cleanup.
      
      This is safe because kernel "free" functions are designed to be NULL safe
      and perform no action if passed a NULL pointer. Thus its safe to simply
      always call kfree or bitmap_free even if one of those pointers was NULL.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      c1f3414d
    • Jacob Keller's avatar
      ice: protect init and calibrating check in ice_ptp_request_ts · 3ad5c10b
      Jacob Keller authored
      When requesting a new timestamp, the ice_ptp_request_ts function does not
      hold the Tx tracker lock while checking init and calibrating. This means
      that we might issue a new timestamp request just after the Tx timestamp
      tracker starts being deinitialized. This could lead to incorrect access of
      the timestamp structures. Correct this by moving the init and calibrating
      checks under the lock, and updating the flows which modify these fields to
      use the lock.
      
      Note that we do not need to hold the lock while checking for tx->init in
      ice_ptp_tx_tstamp. This is because the teardown function will use
      synchronize_irq after clearing the flag to ensure that the threaded
      interrupt completes. Either a) the tx->init flag will be cleared before the
      ice_ptp_tx_tstamp function starts, thus it will exit immediately, or b) the
      threaded interrupt will be executing and the synchronize_irq will wait
      until the threaded interrupt has completed at which point we know the init
      field has definitely been set and new interrupts will not execute the Tx
      timestamp thread function.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      3ad5c10b
    • Jacob Keller's avatar
      ice: synchronize the misc IRQ when tearing down Tx tracker · f0ae1240
      Jacob Keller authored
      Since commit 1229b339 ("ice: Add low latency Tx timestamp read") the
      ice driver has used a threaded IRQ for handling Tx timestamps. This change
      did not add a call to synchronize_irq during ice_ptp_release_tx_tracker.
      Thus it is possible that an interrupt could occur just as the tracker is
      being removed. This could lead to a use-after-free of the Tx tracker
      structure data.
      
      Fix this by calling sychronize_irq in ice_ptp_release_tx_tracker after
      we've cleared the init flag. In addition, make sure that we re-check the
      init flag at the end of ice_ptp_tx_tstamp before we exit ensuring that we
      will stop polling for new timestamps once the tracker de-initialization has
      begun.
      
      Refactor the ts_handled variable into "more_timestamps" so that we can
      simply directly assign this boolean instead of relying on an initialized
      value of true. This makes the new combined check easier to read.
      
      With this change, the ice_ptp_release_tx_tracker function will now wait for
      the threaded interrupt to complete if it was executing while the init flag
      was cleared.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      f0ae1240
    • Jacob Keller's avatar
      ice: check Tx timestamp memory register for ready timestamps · 10e4b4a3
      Jacob Keller authored
      The PHY for E822 based hardware has a register which indicates which
      timestamps are valid in the PHY timestamp memory block. Each bit in the
      register indicates whether the associated index in the timestamp memory is
      valid.
      
      Hardware sets this bit when the timestamp is captured, and clears the bit
      when the timestamp is read. Use of this register is important as reading
      timestamp registers can impact the way that hardware generates timestamp
      interrupts.
      
      This occurs because the PHY has an internal value which is incremented
      when hardware captures a timestamp and decremented when software reads a
      timestamp. Reading timestamps which are not marked as valid still decrement
      the internal value and can result in the Tx timestamp interrupt not
      triggering in the future.
      
      To prevent this, use the timestamp memory value to determine which
      timestamps are ready to be read. The ice_get_phy_tx_tstamp_ready function
      reads this value. For E810 devices, this just always returns with all bits
      set.
      
      Skip any timestamp which is not set in this bitmap, avoiding reading extra
      timestamps on E822 devices.
      
      The stale check against a cached timestamp value is no longer necessary for
      PHYs which support the timestamp ready bitmap properly. E810 devices still
      need this. Introduce a new verify_cached flag to the ice_ptp_tx structure.
      Use this to determine if we need to perform the verification against the
      cached timestamp value. Set this to 1 for the E810 Tx tracker init
      function. Notice that many of the fields in ice_ptp_tx are simple 1 bit
      flags. Save some structure space by using bitfields of length 1 for these
      values.
      
      Modify the ICE_PTP_TS_VALID check to simply drop the timestamp immediately
      so that in an event of getting such an invalid timestamp the driver does
      not attempt to re-read the timestamp again in a future poll of the
      register.
      
      With these changes, the driver now reads each timestamp register exactly
      once, and does not attempt any re-reads. This ensures the interrupt
      tracking logic in the PHY will not get stuck.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      10e4b4a3
    • Jacob Keller's avatar
      ice: handle discarding old Tx requests in ice_ptp_tx_tstamp · 0dd92862
      Jacob Keller authored
      Currently the driver uses the PTP kthread to process handling and
      discarding of stale Tx timestamp requests. The function
      ice_ptp_tx_tstamp_cleanup is used for this.
      
      A separate thread creates complications for the driver as we now have both
      the main Tx timestamp processing IRQ checking timestamps as well as the
      kthread.
      
      Rather than using the kthread to handle this, simply check for stale
      timestamps within the ice_ptp_tx_tstamp function. This function must
      already process the timestamps anyways.
      
      If a Tx timestamp has been waiting for 2 seconds we simply clear the bit
      and discard the SKB. This avoids the complication of having separate
      threads polling, reducing overall CPU work.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      0dd92862
    • Jacob Keller's avatar
      ice: always call ice_ptp_link_change and make it void · 6b1ff5d3
      Jacob Keller authored
      The ice_ptp_link_change function is currently only called for E822 based
      hardware. Future changes are going to extend this function to perform
      additional tasks on link change.
      
      Always call this function, moving the E810 check from the callers down to
      just before we call the E822-specific function required to restart the PHY.
      
      This function also returns an error value, but none of the callers actually
      check it. In general, the errors it produces are more likely systemic
      problems such as invalid or corrupt port numbers. No caller checks these,
      and so no warning is logged.
      
      Re-order the flag checks so that ICE_FLAG_PTP is checked first. Drop the
      unnecessary check for ICE_FLAG_PTP_SUPPORTED, as ICE_FLAG_PTP will not be
      set except when ICE_FLAG_PTP_SUPPORTED is set.
      
      Convert the port checks to WARN_ON_ONCE, in order to generate a kernel
      stack trace when they are hit.
      
      Convert the function to void since no caller actually checks these return
      values.
      Co-developed-by: default avatarDave Ertman <david.m.ertman@intel.com>
      Signed-off-by: default avatarDave Ertman <david.m.ertman@intel.com>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      6b1ff5d3
    • Jacob Keller's avatar
      ice: fix misuse of "link err" with "link status" · 11722c39
      Jacob Keller authored
      The ice_ptp_link_change function has a comment which mentions "link
      err" when referring to the current link status. We are storing the status
      of whether link is up or down, which is not an error.
      
      It is appears that this use of err accidentally got included due to an
      overzealous search and replace when removing the ice_status enum and local
      status variable.
      
      Fix the wording to use the correct term.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      11722c39
    • Karol Kolacinski's avatar
      ice: Reset TS memory for all quads · 407b66c0
      Karol Kolacinski authored
      In E822 products, the owner PF should reset memory for all quads, not
      only for the one where assigned lport is.
      Signed-off-by: default avatarKarol Kolacinski <karol.kolacinski@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      407b66c0
    • Milena Olech's avatar
      ice: Remove the E822 vernier "bypass" logic · 0357d5ca
      Milena Olech authored
      The E822 devices support an extended "vernier" calibration which enables
      higher precision timestamps by accounting for delays in the PHY, and
      compensating for them. These delays are measured by hardware as part of its
      vernier calibration logic.
      
      The driver currently starts the PHY in "bypass" mode which skips
      the compensation. Then it later attempts to switch from bypass to vernier.
      This unfortunately does not work as expected. Instead of properly
      compensating for the delays, the hardware continues operating in bypass
      without the improved precision expected.
      
      Because we cannot dynamically switch between bypass and vernier mode,
      refactor the driver to always operate in vernier mode. This has a slight
      downside: Tx timestamp and Rx timestamp requests that occur as the very
      first packet set after link up will not complete properly and may be
      reported to applications as missing timestamps.
      
      This occurs frequently in test environments where traffic is light or
      targeted specifically at testing PTP. However, in practice most
      environments will have transmitted or received some data over the network
      before such initial requests are made.
      Signed-off-by: default avatarMilena Olech <milena.olech@intel.com>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      0357d5ca
    • Sergey Temerkhanov's avatar
      ice: Use more generic names for ice_ptp_tx fields · 6b5cbc8c
      Sergey Temerkhanov authored
      Some supported devices have per-port timestamp memory blocks while
      others have shared ones within quads. Rename the struct ice_ptp_tx
      fields to reflect the block entities it works with
      Signed-off-by: default avatarSergey Temerkhanov <sergey.temerkhanov@intel.com>
      Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      6b5cbc8c
    • Oleksij Rempel's avatar
    • Jakub Kicinski's avatar
      Merge branch 'net-ethernet-ti-am65-cpsw-fix-set-channel-operation' · d8b879c0
      Jakub Kicinski authored
      Roger Quadros says:
      
      ====================
      net: ethernet: ti: am65-cpsw: Fix set channel operation
      
      This contains a critical bug fix for the recently merged suspend/resume
      support [1] that broke set channel operation. (ethtool -L eth0 tx <n>)
      
      As there were 2 dependent patches on top of the offending commit [1]
      first revert them and then apply them back after the correct fix.
      
      [1] fd23df72 ("net: ethernet: ti: am65-cpsw: Add suspend/resume support")
      ====================
      
      Link: https://lore.kernel.org/r/20221206094419.19478-1-rogerq@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d8b879c0
    • Roger Quadros's avatar
      net: ethernet: ti: am65-cpsw: Fix hardware switch mode on suspend/resume · 020b232f
      Roger Quadros authored
      On low power during system suspend the ALE table context is lost.
      Save the ALE context before suspend and restore it after resume.
      Signed-off-by: default avatarRoger Quadros <rogerq@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      020b232f
    • Roger Quadros's avatar
      net: ethernet: ti: am65-cpsw: retain PORT_VLAN_REG after suspend/resume · 1581cd8b
      Roger Quadros authored
      During suspend resume the context of PORT_VLAN_REG is lost so
      save it during suspend and restore it during resume for
      host port and slave ports.
      Signed-off-by: default avatarRoger Quadros <rogerq@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1581cd8b
    • Roger Quadros's avatar
      net: ethernet: ti: am65-cpsw: Add suspend/resume support · 24bc19b0
      Roger Quadros authored
      Add PM handlers for System suspend/resume.
      
      As DMA driver doesn't yet support suspend/resume we free up
      the DMA channels at suspend and acquire and initialize them
      at resume.
      
      In this revised approach we do not free the TX/RX IRQs at
      am65_cpsw_nuss_common_stop() as it causes problems.
      We will now free them only on .suspend() as we need to release
      the DMA channels (as DMA looses context) and re-acquiring
      them on .resume() may not necessarily give us the same
      IRQs.
      
      To make this easier:
      - introduce am65_cpsw_nuss_remove_rx_chns() which is
         similar to am65_cpsw_nuss_remove_tx_chns(). These will
         be invoked in pm.suspend() to release the DMA channels
         and free up the IRQs.
      - move napi_add() and request_irq() calls to
         am65_cpsw_nuss_init_rx/tx_chns() so we can invoke them
         in pm.resume() to acquire the DMA channels and IRQs.
      
      As CPTS looses contect during suspend/resume, invoke the
      necessary CPTS suspend/resume helpers.
      
      ALE_CLEAR command is issued in cpsw_ale_start() so no need
      to issue it before the call to cpsw_ale_start().
      Signed-off-by: default avatarRoger Quadros <rogerq@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      24bc19b0
    • Roger Quadros's avatar
      Revert "net: ethernet: ti: am65-cpsw: Add suspend/resume support" · 1a014663
      Roger Quadros authored
      This reverts commit fd23df72.
      
      This commit broke set channel operation. Revert this and
      implement it with a different approach in a separate patch.
      Signed-off-by: default avatarRoger Quadros <rogerq@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1a014663
    • Roger Quadros's avatar
      Revert "net: ethernet: ti: am65-cpsw: retain PORT_VLAN_REG after suspend/resume" · 1bae8fa8
      Roger Quadros authored
      This reverts commit 643cf0e3.
      
      This is to make it easier to revert the offending commit
      fd23df72 ("net: ethernet: ti: am65-cpsw: Add suspend/resume support")
      Signed-off-by: default avatarRoger Quadros <rogerq@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1bae8fa8
    • Roger Quadros's avatar
      Revert "net: ethernet: ti: am65-cpsw: Fix hardware switch mode on suspend/resume" · 1a352596
      Roger Quadros authored
      This reverts commit 1af3cb37.
      
      This is to make it easier to revert the offending commit
      fd23df72 ("net: ethernet: ti: am65-cpsw: Add suspend/resume support")
      Signed-off-by: default avatarRoger Quadros <rogerq@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1a352596
    • Jakub Kicinski's avatar
      Merge branch 'devlink-add-port-function-attribute-to-enable-disable-roce-and-migratable' · e1228581
      Jakub Kicinski authored
      Shay Drory says:
      
      ====================
      devlink: Add port function attribute to enable/disable Roce and migratable
      
      This series is a complete rewrite of the series "devlink: Add port
      function attribute to enable/disable roce"
      link:
      https://lore.kernel.org/netdev/20221102163954.279266-1-danielj@nvidia.com/
      
      Currently mlx5 PCI VF and SF are enabled by default for RoCE
      functionality. And mlx5 PCI VF is disable by dafault for migratable
      functionality.
      
      Currently a user does not have the ability to disable RoCE for a PCI
      VF/SF device before such device is enumerated by the driver.
      
      User is also incapable to do such setting from smartnic scenario for a
      VF from the smartnic.
      
      Current 'enable_roce' device knob is limited to do setting only at
      driverinit time. By this time device is already created and firmware has
      already allocated necessary system memory for supporting RoCE.
      
      Also, Currently a user does not have the ability to enable migratable
      for a PCI VF.
      
      The above are a hyper visor level control, to set the functionality of
      devices passed through to guests.
      
      This is achieved by extending existing 'port function' object to control
      capabilities of a function. This enables users to control capability of
      the device before enumeration.
      
      Examples when user prefers to disable RoCE for a VF when using switchdev
      mode:
      
      $ devlink port show pci/0000:06:00.0/1
      pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller 0
      pfnum 0 vfnum 0 external false splittable false
        function:
          hw_addr 00:00:00:00:00:00 roce enable
      
      $ devlink port function set pci/0000:06:00.0/1 roce disable
      
      $ devlink port show pci/0000:06:00.0/1
      pci/0000:06:00.0/1: type eth netdev pf0vf0 flavour pcivf controller 0
      pfnum 0 vfnum 0 external false splittable false
        function:
          hw_addr 00:00:00:00:00:00 roce disable
      
      FAQs:
      -----
      1. What does roce enable/disable do?
      Ans: It disables RoCE capability of the function before its enumerated,
      so when driver reads the capability from the device firmware, it is
      disabled.
      At this point RDMA stack will not be able to create UD, QP1, RC, XRC
      type of QPs. When RoCE is disabled, the GID table of all ports of the
      device is disabled in the device and software stack.
      
      2. How is the roce 'port function' option different from existing
      devlink param?
      Ans: RoCE attribute at the port function level disables the RoCE
      capability at the specific function level; while enable_roce only does
      at the software level.
      
      3. Why is this option for disabling only RoCE and not the whole RDMA
      device?
      Ans: Because user still wants to use the RDMA device for non RoCE
      commands in more memory efficient way.
      ====================
      
      Link: https://lore.kernel.org/r/20221206185119.380138-1-shayd@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1228581
    • Shay Drory's avatar
      net/mlx5: E-Switch, Implement devlink port function cmds to control migratable · e5b9642a
      Shay Drory authored
      Implement devlink port function commands to enable / disable migratable.
      This is used to control the migratable capability of the device.
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e5b9642a
    • Shay Drory's avatar
      devlink: Expose port function commands to control migratable · a8ce7b26
      Shay Drory authored
      Expose port function commands to enable / disable migratable
      capability, this is used to set the port function as migratable.
      
      Live migration is the process of transferring a live virtual machine
      from one physical host to another without disrupting its normal
      operation.
      
      In order for a VM to be able to perform LM, all the VM components must
      be able to perform migration. e.g.: to be migratable.
      In order for VF to be migratable, VF must be bound to VFIO driver with
      migration support.
      
      When migratable capability is enabled for a function of the port, the
      device is making the necessary preparations for the function to be
      migratable, which might include disabling features which cannot be
      migrated.
      
      Example of LM with migratable function configuration:
      Set migratable of the VF's port function.
      
      $ devlink port show pci/0000:06:00.0/2
      pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
      vfnum 1
          function:
              hw_addr 00:00:00:00:00:00 migratable disable
      
      $ devlink port function set pci/0000:06:00.0/2 migratable enable
      
      $ devlink port show pci/0000:06:00.0/2
      pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
      vfnum 1
          function:
              hw_addr 00:00:00:00:00:00 migratable enable
      
      Bind VF to VFIO driver with migration support:
      $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
      $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
      $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
      
      Attach VF to the VM.
      Start the VM.
      Perform LM.
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a8ce7b26
    • Yishai Hadas's avatar
      net/mlx5: E-Switch, Implement devlink port function cmds to control RoCE · 7db98396
      Yishai Hadas authored
      Implement devlink port function commands to enable / disable RoCE.
      This is used to control the RoCE device capabilities.
      
      This patch implement infrastructure which will be used by downstream
      patches that will add additional capabilities.
      Signed-off-by: default avatarYishai Hadas <yishaih@nvidia.com>
      Signed-off-by: default avatarDaniel Jurgens <danielj@nvidia.com>
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7db98396
    • Shay Drory's avatar
      net/mlx5: Add generic getters for other functions caps · 47d0c500
      Shay Drory authored
      Downstream patch requires to get other function GENERAL2 caps while
      mlx5_vport_get_other_func_cap() gets only one type of caps (general).
      Rename it to represent this and introduce a generic implementation
      of mlx5_vport_get_other_func_cap().
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      47d0c500
    • Shay Drory's avatar
      devlink: Expose port function commands to control RoCE · da65e9ff
      Shay Drory authored
      Expose port function commands to enable / disable RoCE, this is used to
      control the port RoCE device capabilities.
      
      When RoCE is disabled for a function of the port, function cannot create
      any RoCE specific resources (e.g GID table).
      It also saves system memory utilization. For example disabling RoCE enable a
      VF/SF saves 1 Mbytes of system memory per function.
      
      Example of a PCI VF port which supports function configuration:
      Set RoCE of the VF's port function.
      
      $ devlink port show pci/0000:06:00.0/2
      pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
      vfnum 1
          function:
              hw_addr 00:00:00:00:00:00 roce enable
      
      $ devlink port function set pci/0000:06:00.0/2 roce disable
      
      $ devlink port show pci/0000:06:00.0/2
      pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
      vfnum 1
          function:
              hw_addr 00:00:00:00:00:00 roce disable
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      da65e9ff
    • Shay Drory's avatar
      devlink: Move devlink port function hw_addr attr documentation · 875cd5ee
      Shay Drory authored
      devlink port function hw_addr attr documentation is in mlx5 specific
      file while there is nothing mlx5 specific about it.
      Move it to devlink-port.rst.
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      875cd5ee
    • Shay Drory's avatar
      devlink: Validate port function request · c0bea69d
      Shay Drory authored
      In order to avoid partial request processing, validate the request
      before processing it.
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c0bea69d
    • Yishai Hadas's avatar
      net/mlx5: Introduce IFC bits for migratable · df268f6c
      Yishai Hadas authored
      Introduce IFC related capabilities to enable setting VF to be able to
      perform live migration. e.g.: to be migratable.
      Signed-off-by: default avatarYishai Hadas <yishaih@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      df268f6c
    • Jakub Kicinski's avatar
      Merge branch 'bridge-mcast-preparations-for-evpn-extensions' · 5955a948
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      bridge: mcast: Preparations for EVPN extensions
      
      This patchset was split from [1] and includes non-functional changes
      aimed at making it easier to add additional netlink attributes later on.
      Future extensions are available here [2].
      
      The idea behind these patches is to create an MDB configuration
      structure into which netlink messages are parsed into. The structure is
      then passed in the entry creation / deletion call chain instead of
      passing the netlink attributes themselves. The same pattern is used by
      other rtnetlink objects such as routes and nexthops.
      
      I initially tried to extend the current code, but it proved to be too
      difficult, which is why I decided to refactor it to the extensible and
      familiar pattern used by other rtnetlink objects.
      
      Tested using existing selftests and using a new selftest that will be
      submitted together with the planned extensions.
      
      [1] https://lore.kernel.org/netdev/20221018120420.561846-1-idosch@nvidia.com/
      [2] https://github.com/idosch/linux/commits/submit/mdb_v1
      ====================
      
      Link: https://lore.kernel.org/r/20221206105809.363767-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5955a948
    • Ido Schimmel's avatar
      bridge: mcast: Constify 'group' argument in br_multicast_new_port_group() · f86c3e2c
      Ido Schimmel authored
      The 'group' argument is not modified, so mark it as 'const'. It will
      allow us to constify arguments of the callers of this function in future
      patches.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f86c3e2c
    • Ido Schimmel's avatar
      bridge: mcast: Remove redundant function arguments · 090149ea
      Ido Schimmel authored
      Drop the first three arguments and instead extract them from the MDB
      configuration structure.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      090149ea
    • Ido Schimmel's avatar
      bridge: mcast: Move checks out of critical section · 4c1ebc6c
      Ido Schimmel authored
      The checks only require information parsed from the RTM_NEWMDB netlink
      message and do not rely on any state stored in the bridge driver.
      Therefore, there is no need to perform the checks in the critical
      section under the multicast lock.
      
      Move the checks out of the critical section.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4c1ebc6c
    • Ido Schimmel's avatar
      bridge: mcast: Remove br_mdb_parse() · 3ee56623
      Ido Schimmel authored
      The parsing of the netlink messages and the validity checks are now
      performed in br_mdb_config_init() so we can remove br_mdb_parse().
      
      This finally allows us to stop passing netlink attributes deep in the
      MDB control path and only use the MDB configuration structure.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3ee56623
    • Ido Schimmel's avatar
      bridge: mcast: Use MDB group key from configuration structure · 9f52a514
      Ido Schimmel authored
      The MDB group key (i.e., {source, destination, protocol, VID}) is
      currently determined under the multicast lock from the netlink
      attributes. Instead, use the group key from the MDB configuration
      structure that was prepared before acquiring the lock.
      
      No functional changes intended.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9f52a514
    • Ido Schimmel's avatar
      bridge: mcast: Propagate MDB configuration structure further · 8bd9c08e
      Ido Schimmel authored
      As an intermediate step towards only using the new MDB configuration
      structure, pass it further in the control path instead of passing
      individual attributes.
      
      No functional changes intended.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8bd9c08e
    • Ido Schimmel's avatar
      bridge: mcast: Use MDB configuration structure where possible · f2b5aac6
      Ido Schimmel authored
      The MDB configuration structure (i.e., struct br_mdb_config) now
      includes all the necessary information from the parsed RTM_{NEW,DEL}MDB
      netlink messages, so use it.
      
      This will later allow us to delete the calls to br_mdb_parse() from
      br_mdb_add() and br_mdb_del().
      
      No functional changes intended.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f2b5aac6
    • Ido Schimmel's avatar
      bridge: mcast: Remove redundant checks · 38661168
      Ido Schimmel authored
      These checks are now redundant as they are performed by
      br_mdb_config_init() while parsing the RTM_{NEW,DEL}MDB messages.
      
      Remove them.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      38661168
    • Ido Schimmel's avatar
      bridge: mcast: Centralize netlink attribute parsing · cb453926
      Ido Schimmel authored
      Netlink attributes are currently passed deep in the MDB creation call
      chain, making it difficult to add new attributes. In addition, some
      validity checks are performed under the multicast lock although they can
      be performed before it is ever acquired.
      
      As a first step towards solving these issues, parse the RTM_{NEW,DEL}MDB
      messages into a configuration structure, relieving other functions from
      the need to handle raw netlink attributes.
      
      Subsequent patches will convert the MDB code to use this configuration
      structure.
      
      This is consistent with how other rtnetlink objects are handled, such as
      routes and nexthops.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cb453926
    • ye xingchen's avatar
      net: ethernet: use sysfs_emit() to instead of scnprintf() · 16dc16d9
      ye xingchen authored
      Follow the advice of the Documentation/filesystems/sysfs.rst and show()
      should only use sysfs_emit() or sysfs_emit_at() when formatting the
      value to be returned to user space.
      Signed-off-by: default avatarye xingchen <ye.xingchen@zte.com.cn>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/202212051918564721658@zte.com.cnSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      16dc16d9