1. 21 Nov, 2023 4 commits
    • Sarath Babu Naidu Gaddam's avatar
      net: axienet: Preparatory changes for dmaengine support · 6b1b40f7
      Sarath Babu Naidu Gaddam authored
      The axiethernet driver has inbuilt dma programming. In order to add
      dmaengine support and make it's integration seamless the current axidma
      inbuilt programming code is put under use_dmaengine check.
      
      It also performs minor code reordering to minimize conditional
      use_dmaengine checks and there is no functional change. It uses
      "dmas" property to identify whether it should use a dmaengine
      framework or inbuilt axidma programming.
      Signed-off-by: default avatarSarath Babu Naidu Gaddam <sarath.babu.naidu.gaddam@amd.com>
      Signed-off-by: default avatarRadhey Shyam Pandey <radhey.shyam.pandey@amd.com>
      Link: https://lore.kernel.org/r/1700074613-1977070-3-git-send-email-radhey.shyam.pandey@amd.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6b1b40f7
    • Radhey Shyam Pandey's avatar
      dt-bindings: net: xlnx,axi-ethernet: Introduce DMA support · 5e63c5ef
      Radhey Shyam Pandey authored
      Xilinx 1G/2.5G Ethernet Subsystem provides 32-bit AXI4-Stream buses to
      move transmit and receive Ethernet data to and from the subsystem.
      
      These buses are designed to be used with an AXI Direct Memory Access(DMA)
      IP or AXI Multichannel Direct Memory Access (MCDMA) IP core, AXI4-Stream
      Data FIFO, or any other custom logic in any supported device.
      
      Primary high-speed DMA data movement between system memory and stream
      target is through the AXI4 Read Master to AXI4 memory-mapped to stream
      (MM2S) Master, and AXI stream to memory-mapped (S2MM) Slave to AXI4
      Write Master. AXI DMA/MCDMA enables channel of data movement on both
      MM2S and S2MM paths in scatter/gather mode.
      
      AXI DMA has two channels where as MCDMA has 16 Tx and 16 Rx channels.
      To uniquely identify each channel use 'chan' suffix. Depending on the
      usecase AXI ethernet driver can request any combination of multichannel
      DMA channels using generic dmas, dma-names properties.
      
      Example:
      dma-names = tx_chan0, rx_chan0, tx_chan1, rx_chan1;
      Signed-off-by: default avatarRadhey Shyam Pandey <radhey.shyam.pandey@amd.com>
      Reviewed-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Link: https://lore.kernel.org/r/1700074613-1977070-2-git-send-email-radhey.shyam.pandey@amd.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5e63c5ef
    • Willem de Bruijn's avatar
      selftests: net: verify fq per-band packet limit · a0bc96c0
      Willem de Bruijn authored
      Commit 29f834aa ("net_sched: sch_fq: add 3 bands and WRR
      scheduling") introduces multiple traffic bands, and per-band maximum
      packet count.
      
      Per-band limits ensures that packets in one class cannot fill the
      entire qdisc and so cause DoS to the traffic in the other classes.
      
      Verify this behavior:
        1. set the limit to 10 per band
        2. send 20 pkts on band A: verify that 10 are queued, 10 dropped
        3. send 20 pkts on band A: verify that  0 are queued, 20 dropped
        4. send 20 pkts on band B: verify that 10 are queued, 10 dropped
      
      Packets must remain queued for a period to trigger this behavior.
      Use SO_TXTIME to store packets for 100 msec.
      
      The test reuses existing upstream test infra. The script is a fork of
      cmsg_time.sh. The scripts call cmsg_sender.
      
      The test extends cmsg_sender with two arguments:
      
      * '-P' SO_PRIORITY
        There is a subtle difference between IPv4 and IPv6 stack behavior:
        PF_INET/IP_TOS        sets IP header bits and sk_priority
        PF_INET6/IPV6_TCLASS  sets IP header bits BUT NOT sk_priority
      
      * '-n' num pkts
        Send multiple packets in quick succession.
        I first attempted a for loop in the script, but this is too slow in
        virtualized environments, causing flakiness as the 100ms timeout is
        reached and packets are dequeued.
      
      Also do not wait for timestamps to be queued unless timestamps are
      requested.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20231116203449.2627525-1-willemdebruijn.kernel@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a0bc96c0
    • Vishvambar Panth S's avatar
      net: microchip: lan743x : bidirectional throughput improvement · 45933b2d
      Vishvambar Panth S authored
      The LAN743x/PCI11xxx DMA descriptors are always 4 dwords long, but the
      device supports placing the descriptors in memory back to back or
      reserving space in between them using its DMA_DESCRIPTOR_SPACE (DSPACE)
      configurable hardware setting. Currently DSPACE is unnecessarily set to
      match the host's L1 cache line size, resulting in space reserved in
      between descriptors in most platforms and causing a suboptimal behavior
      (single PCIe Mem transaction per descriptor). By changing the setting
      to DSPACE=16 many descriptors can be packed in a single PCIe Mem
      transaction resulting in a massive performance improvement in
      bidirectional tests without any negative effects.
      Tested and verified improvements on x64 PC and several ARM platforms
      (typical data below)
      
      Test setup 1: x64 PC with LAN7430 ---> x64 PC
      
      iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID][Role] Interval           Transfer     Bitrate
      [  5][TX-C]   0.00-10.00  sec   170 MBytes   143 Mbits/sec  sender
      [  5][TX-C]   0.00-10.04  sec   169 MBytes   141 Mbits/sec  receiver
      [  7][RX-C]   0.00-10.00  sec  1.02 GBytes   876 Mbits/sec  sender
      [  7][RX-C]   0.00-10.04  sec  1.02 GBytes   870 Mbits/sec  receiver
      
      iperf3 UDP bidirectional with DSPACE set to 16 Bytes
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID][Role] Interval           Transfer     Bitrate
      [  5][TX-C]   0.00-10.00  sec  1.11 GBytes   956 Mbits/sec  sender
      [  5][TX-C]   0.00-10.04  sec  1.11 GBytes   951 Mbits/sec  receiver
      [  7][RX-C]   0.00-10.00  sec  1.10 GBytes   948 Mbits/sec  sender
      [  7][RX-C]   0.00-10.04  sec  1.10 GBytes   942 Mbits/sec  receiver
      
      Test setup 2 : RK3399 with LAN7430 ---> x64 PC
      
      RK3399 Spec:
      The SOM-RK3399 is ARM module designed and developed by FriendlyElec.
      Cores: 64-bit Dual Core Cortex-A72 + Quad Core Cortex-A53
      Frequency: Cortex-A72(up to 2.0GHz), Cortex-A53(up to 1.5GHz)
      PCIe: PCIe x4, compatible with PCIe 2.1, Dual operation mode
      
      iperf3 UDP bidirectional with DSPACE set to L1 CACHE Size:
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID][Role] Interval           Transfer     Bitrate
      [  5][TX-C]   0.00-10.00  sec   534 MBytes   448 Mbits/sec  sender
      [  5][TX-C]   0.00-10.05  sec   534 MBytes   446 Mbits/sec  receiver
      [  7][RX-C]   0.00-10.00  sec  1.12 GBytes   961 Mbits/sec  sender
      [  7][RX-C]   0.00-10.05  sec  1.11 GBytes   946 Mbits/sec  receiver
      
      iperf3 UDP bidirectional with DSPACE set to 16 Bytes
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID][Role] Interval           Transfer     Bitrate
      [  5][TX-C]   0.00-10.00  sec   966 MBytes   810 Mbits/sec   sender
      [  5][TX-C]   0.00-10.04  sec   965 MBytes   806 Mbits/sec   receiver
      [  7][RX-C]   0.00-10.00  sec  1.11 GBytes   956 Mbits/sec   sender
      [  7][RX-C]   0.00-10.04  sec  1.07 GBytes   919 Mbits/sec   receiver
      Signed-off-by: default avatarVishvambar Panth S <vishvambarpanth.s@microchip.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20231116054350.620420-1-vishvambarpanth.s@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      45933b2d
  2. 19 Nov, 2023 13 commits
  3. 18 Nov, 2023 23 commits
    • Heiner Kallweit's avatar
      r8169: improve RTL8411b phy-down fixup · 055dd751
      Heiner Kallweit authored
      Mirsad proposed a patch to reduce the number of spinlock lock/unlock
      operations and the function code size. This can be further improved
      because the function sets a consecutive register block.
      Suggested-by: default avatarMirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarMirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      055dd751
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2023-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · ce30df20
      David S. Miller authored
      mlx5-updates-2023-11-13
      
      1) Cleanup patches, leftovers from previous cycle
      
      2) Allow sync reset flow when BF MGT interface device is present
      
      3) Trivial ptp refactorings and improvements
      
      4) Add local loopback counter to vport rep stats
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce30df20
    • David S. Miller's avatar
      Merge tag 'batadv-next-pullrequest-20231115' of git://git.open-mesh.org/linux-merge · 39620a35
      David S. Miller authored
      This feature/cleanup patchset includes the following patches:
      
       - bump version strings, by Simon Wunderlich
      
       - Implement new multicast packet type, including its transmission,
         forwarding and parsing, by Linus Lüssing (3 patches)
      
       - Switch to new headers for sprintf and array size,
         by Sven Eckelmann (2 patches)
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39620a35
    • David S. Miller's avatar
      Merge branch 'mlxsw-new-reset-flow' · 72a813a4
      David S. Miller authored
      Petr Machata says:
      
      ====================
      mlxsw: Add support for new reset flow
      
      Ido Schimmel writes:
      
      This patchset changes mlxsw to issue a PCI reset during probe and
      devlink reload so that the PCI firmware could be upgraded without a
      reboot.
      
      Unlike the old version of this patchset [1], in this version the driver
      no longer tries to issue a PCI reset by triggering a PCI link toggle on
      its own, but instead calls the PCI core to issue the reset.
      
      The PCI APIs require the device lock to be held which is why patches
      
      Patches #7 adds reset method quirk for NVIDIA Spectrum devices.
      
      Patch #8 adds a debug level print in PCI core so that device ready delay
      will be printed even if it is shorter than one second.
      
      Patches #9-#11 are straightforward preparations in mlxsw.
      
      Patch #12 finally implements the new reset flow in mlxsw.
      
      Patch #13 adds PCI reset handlers in mlxsw to avoid user space from
      resetting the device from underneath an unaware driver. Instead, the
      driver is gracefully de-initialized before the PCI reset and then
      initialized again after it.
      
      Patch #14 adds a PCI reset selftest to make sure this code path does not
      regress.
      
      [1] https://lore.kernel.org/netdev/cover.1679502371.git.petrm@nvidia.com/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72a813a4
    • Ido Schimmel's avatar
      selftests: mlxsw: Add PCI reset test · af51d6bd
      Ido Schimmel authored
      Test that PCI reset works correctly by verifying that only the expected
      reset methods are supported and that after issuing the reset the ifindex
      of the port changes.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af51d6bd
    • Ido Schimmel's avatar
      mlxsw: pci: Implement PCI reset handlers · 5e12d089
      Ido Schimmel authored
      Implement reset_prepare() and reset_done() handlers that are invoked by
      the PCI core before and after issuing a PCI reset, respectively.
      
      Specifically, implement reset_prepare() by calling
      mlxsw_core_bus_device_unregister() and reset_done() by calling
      mlxsw_core_bus_device_register(). This is the same implementation as the
      reload_{down,up}() devlink operations with the following differences:
      
      1. The devlink instance is unregistered and then registered again after
         the reset.
      
      2. A reset via the device's command interface (using MRSR register) is
         not issued during reset_done() as PCI core already issued a PCI
         reset.
      
      Tested:
      
       # for i in $(seq 1 10); do echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset; done
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e12d089
    • Ido Schimmel's avatar
      mlxsw: pci: Add support for new reset flow · f257c73e
      Ido Schimmel authored
      The driver resets the device during probe and during a devlink reload.
      The current reset method reloads the current firmware version or a
      pending one, if one was previously flashed using devlink. However, the
      current reset method does not result in a PCI hot reset, preventing the
      PCI firmware from being upgraded, unless the system is rebooted.
      
      To solve this problem, a new reset command (6) was implemented in the
      firmware. Unlike the current command (1), after issuing the new command
      the device will not start the reset immediately, but only after a PCI
      hot reset.
      
      Implement the new reset method by first verifying that it is supported
      by the current firmware version by querying the Management Capabilities
      Mask (MCAM) register. If supported, issue the new reset command (6) via
      MRSR register followed by a PCI reset by calling
      __pci_reset_function_locked().
      
      Once the PCI firmware is operational, go back to the regular reset flow
      and wait for the entire device to become ready. That is, repeatedly read
      the "system_status" register from the BAR until a value of "FW_READY"
      (0x5E) appears.
      
      Tested:
      
       # for i in $(seq 1 10); do devlink dev reload pci/0000:01:00.0; done
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f257c73e
    • Amit Cohen's avatar
      mlxsw: pci: Move software reset code to a separate function · 8d9da467
      Amit Cohen authored
      In general, the existing flow of software reset in the driver is:
      1. Wait for system ready status.
      2. Send MRSR command, to start the reset.
      3. Wait for system ready status.
      
      This flow will be extended once a new reset command is supported. As a
      preparation, move step #2 to a separate function.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d9da467
    • Amit Cohen's avatar
      mlxsw: pci: Rename mlxsw_pci_sw_reset() · bdf85f3a
      Amit Cohen authored
      In the next patches, mlxsw_pci_sw_reset() will be extended to support
      more reset types and will not necessarily issue a software reset. Rename
      the function to reflect that.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bdf85f3a
    • Amit Cohen's avatar
      mlxsw: Extend MRSR pack() function to support new commands · e6dbab40
      Amit Cohen authored
      Currently mlxsw_reg_mrsr_pack() always sets 'command=1'. As preparation for
      support of new reset flow, pass the command as an argument to the
      function and add an enum for this field.
      
      For now, always pass 'command=1' to the pack() function.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6dbab40
    • Ido Schimmel's avatar
      PCI: Add debug print for device ready delay · 0a5ef959
      Ido Schimmel authored
      Currently, the time it took a PCI device to become ready after reset is
      only printed if it was longer than 1000ms ('PCI_RESET_WAIT'). However,
      for debugging purposes it is useful to know this time even if it was
      shorter. For example, with the device I am working on, hardware
      engineers asked to verify that it becomes ready on the first try (no
      delay).
      
      To that end, add a debug level print that can be enabled using dynamic
      debug. Example:
      
       # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
       # dmesg -c | grep ready
       # echo "file drivers/pci/pci.c +p" > /sys/kernel/debug/dynamic_debug/control
       # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
       # dmesg -c | grep ready
       [  396.060335] mlxsw_spectrum4 0000:01:00.0: ready 0ms after bus reset
       # echo "file drivers/pci/pci.c -p" > /sys/kernel/debug/dynamic_debug/control
       # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
       # dmesg -c | grep ready
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a5ef959
    • Ido Schimmel's avatar
      PCI: Add no PM reset quirk for NVIDIA Spectrum devices · 3ed48c80
      Ido Schimmel authored
      Spectrum-{1,2,3,4} devices report that a D3hot->D0 transition causes a
      reset (i.e., they advertise NoSoftRst-). However, this transition does
      not have any effect on the device: It continues to be operational and
      network ports remain up. Advertising this support makes it seem as if a
      PM reset is viable for these devices. Mark it as unavailable to skip it
      when testing reset methods.
      
      Before:
      
       # cat /sys/bus/pci/devices/0000\:03\:00.0/reset_method
       pm bus
      
      After:
      
       # cat /sys/bus/pci/devices/0000\:03\:00.0/reset_method
       bus
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ed48c80
    • Ido Schimmel's avatar
      devlink: Add device lock assert in reload operation · 527a07e1
      Ido Schimmel authored
      Add an assert to verify that the device lock is always held throughout
      reload operations.
      
      Tested the following flows with netdevsim and mlxsw while lockdep is
      enabled:
      
      netdevsim:
      
       # echo "10 1" > /sys/bus/netdevsim/new_device
       # devlink dev reload netdevsim/netdevsim10
       # ip netns add bla
       # devlink dev reload netdevsim/netdevsim10 netns bla
       # ip netns del bla
       # echo 10 > /sys/bus/netdevsim/del_device
      
      mlxsw:
      
       # devlink dev reload pci/0000:01:00.0
       # ip netns add bla
       # devlink dev reload pci/0000:01:00.0 netns bla
       # ip netns del bla
       # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
       # echo 1 > /sys/bus/pci/rescan
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      527a07e1
    • Ido Schimmel's avatar
      devlink: Acquire device lock during reload command · bf6b200b
      Ido Schimmel authored
      Device drivers register with devlink from their probe routines (under
      the device lock) by acquiring the devlink instance lock and calling
      devl_register().
      
      Drivers that support a devlink reload usually implement the
      reload_{down, up}() operations in a similar fashion to their remove and
      probe routines, respectively.
      
      However, while the remove and probe routines are invoked with the device
      lock held, the reload operations are only invoked with the devlink
      instance lock held. It is therefore impossible for drivers to acquire
      the device lock from their reload operations, as this would result in
      lock inversion.
      
      The motivating use case for invoking the reload operations with the
      device lock held is in mlxsw which needs to trigger a PCI reset as part
      of the reload. The driver cannot call pci_reset_function() as this
      function acquires the device lock. Instead, it needs to call
      __pci_reset_function_locked which expects the device lock to be held.
      
      To that end, adjust devlink to always acquire the device lock before the
      devlink instance lock when performing a reload.
      
      Do that when reload is explicitly triggered by user space by specifying
      the 'DEVLINK_NL_FLAG_NEED_DEV_LOCK' flag in the pre_doit and post_doit
      operations of the reload command.
      
      A previous patch already handled the case where reload is invoked as
      part of netns dismantle.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf6b200b
    • Ido Schimmel's avatar
      devlink: Allow taking device lock in pre_doit operations · d32c3825
      Ido Schimmel authored
      Introduce a new private flag ('DEVLINK_NL_FLAG_NEED_DEV_LOCK') to allow
      netlink commands to specify that they need to acquire the device lock in
      their pre_doit operation and release it in their post_doit operation.
      
      The reload command will use this flag in the subsequent patch.
      
      No functional changes intended.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d32c3825
    • Ido Schimmel's avatar
      devlink: Enable the use of private flags in post_doit operations · c8d0a7d6
      Ido Schimmel authored
      Currently, private flags (e.g., 'DEVLINK_NL_FLAG_NEED_PORT') are only
      used in pre_doit operations, but a subsequent patch will need to
      conditionally lock and unlock the device lock in pre and post doit
      operations, respectively.
      
      As a preparation, enable the use of private flags in post_doit
      operations in a similar fashion to how it is done for pre_doit
      operations.
      
      No functional changes intended.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8d0a7d6
    • Ido Schimmel's avatar
      devlink: Acquire device lock during netns dismantle · e21c52d7
      Ido Schimmel authored
      Device drivers register with devlink from their probe routines (under
      the device lock) by acquiring the devlink instance lock and calling
      devl_register().
      
      Drivers that support a devlink reload usually implement the
      reload_{down, up}() operations in a similar fashion to their remove and
      probe routines, respectively.
      
      However, while the remove and probe routines are invoked with the device
      lock held, the reload operations are only invoked with the devlink
      instance lock held. It is therefore impossible for drivers to acquire
      the device lock from their reload operations, as this would result in
      lock inversion.
      
      The motivating use case for invoking the reload operations with the
      device lock held is in mlxsw which needs to trigger a PCI reset as part
      of the reload. The driver cannot call pci_reset_function() as this
      function acquires the device lock. Instead, it needs to call
      __pci_reset_function_locked which expects the device lock to be held.
      
      To that end, adjust devlink to always acquire the device lock before the
      devlink instance lock when performing a reload.
      
      For now, only do that when reload is triggered as part of netns
      dismantle. Subsequent patches will handle the case where reload is
      explicitly triggered by user space.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e21c52d7
    • Ido Schimmel's avatar
      devlink: Move private netlink flags to C file · 526dd6d7
      Ido Schimmel authored
      The flags are not used outside of the C file so move them there.
      Suggested-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      526dd6d7
    • David S. Miller's avatar
      Merge branch 'ncsi-mac-address-command' · 4dce97b1
      David S. Miller authored
      Patrick Williams says:
      
      ====================
      net/ncsi: Add NC-SI 1.2 Get MC MAC Address command
      
      NC-SI 1.2 has now been published[1] and adds a new command for "Get MC
      MAC Address".  This is often used by BMCs to get the assigned MAC
      address for the channel used by the BMC.
      
      This change set has been tested on a Broadcomm 200G NIC with updated
      firmware for NC-SI 1.2 and at least one other non-public NIC design.
      
      1. https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.2.0.pdf
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4dce97b1
    • Peter Delevoryas's avatar
      net/ncsi: Add NC-SI 1.2 Get MC MAC Address command · b8291cf3
      Peter Delevoryas authored
      This change adds support for the NC-SI 1.2 Get MC MAC Address command,
      specified here:
      
      https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.2.0.pdf
      
      It serves the exact same function as the existing OEM Get MAC Address
      commands, so if a channel reports that it supports NC-SI 1.2, we prefer
      to use the standard command rather than the OEM command.
      
      Verified with an invalid MAC address and 2 valid ones:
      
      [   55.137072] ftgmac100 1e690000.ftgmac eth0: NCSI: Received 3 provisioned MAC addresses
      [   55.137614] ftgmac100 1e690000.ftgmac eth0: NCSI: MAC address 0: 00:00:00:00:00:00
      [   55.138026] ftgmac100 1e690000.ftgmac eth0: NCSI: MAC address 1: fa:ce:b0:0c:20:22
      [   55.138528] ftgmac100 1e690000.ftgmac eth0: NCSI: MAC address 2: fa:ce:b0:0c:20:23
      [   55.139241] ftgmac100 1e690000.ftgmac eth0: NCSI: Unable to assign 00:00:00:00:00:00 to device
      [   55.140098] ftgmac100 1e690000.ftgmac eth0: NCSI: Set MAC address to fa:ce:b0:0c:20:22
      Signed-off-by: default avatarPeter Delevoryas <peter@pjd.dev>
      Signed-off-by: default avatarPatrick Williams <patrick@stwcx.xyz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8291cf3
    • Peter Delevoryas's avatar
      net/ncsi: Fix netlink major/minor version numbers · 3084b58b
      Peter Delevoryas authored
      The netlink interface for major and minor version numbers doesn't actually
      return the major and minor version numbers.
      
      It reports a u32 that contains the (major, minor, update, alpha1)
      components as the major version number, and then alpha2 as the minor
      version number.
      
      For whatever reason, the u32 byte order was reversed (ntohl): maybe it was
      assumed that the encoded value was a single big-endian u32, and alpha2 was
      the minor version.
      
      The correct way to get the supported NC-SI version from the network
      controller is to parse the Get Version ID response as described in 8.4.44
      of the NC-SI spec[1].
      
          Get Version ID Response Packet Format
      
                    Bits
                  +--------+--------+--------+--------+
           Bytes  | 31..24 | 23..16 | 15..8  | 7..0   |
          +-------+--------+--------+--------+--------+
          | 0..15 | NC-SI Header                      |
          +-------+--------+--------+--------+--------+
          | 16..19| Response code   | Reason code     |
          +-------+--------+--------+--------+--------+
          |20..23 | Major  | Minor  | Update | Alpha1 |
          +-------+--------+--------+--------+--------+
          |24..27 |         reserved         | Alpha2 |
          +-------+--------+--------+--------+--------+
          |            .... other stuff ....          |
      
      The major, minor, and update fields are all binary-coded decimal (BCD)
      encoded [2]. The spec provides examples below the Get Version ID response
      format in section 8.4.44.1, but for practical purposes, this is an example
      from a live network card:
      
          root@bmc:~# ncsi-util 0x15
          NC-SI Command Response:
          cmd: GET_VERSION_ID(0x15)
          Response: COMMAND_COMPLETED(0x0000)  Reason: NO_ERROR(0x0000)
          Payload length = 40
      
          20: 0xf1 0xf1 0xf0 0x00 <<<<<<<<< (major, minor, update, alpha1)
          24: 0x00 0x00 0x00 0x00 <<<<<<<<< (_, _, _, alpha2)
      
          28: 0x6d 0x6c 0x78 0x30
          32: 0x2e 0x31 0x00 0x00
          36: 0x00 0x00 0x00 0x00
          40: 0x16 0x1d 0x07 0xd2
          44: 0x10 0x1d 0x15 0xb3
          48: 0x00 0x17 0x15 0xb3
          52: 0x00 0x00 0x81 0x19
      
      This should be parsed as "1.1.0".
      
      "f" in the upper-nibble means to ignore it, contributing zero.
      
      If both nibbles are "f", I think the whole field is supposed to be ignored.
      Major and minor are "required", meaning they're not supposed to be "ff",
      but the update field is "optional" so I think it can be ff. I think the
      simplest thing to do is just set the major and minor to zero instead of
      juggling some conditional logic or something.
      
      bcd2bin() from "include/linux/bcd.h" seems to assume both nibbles are 0-9,
      so I've provided a custom BCD decoding function.
      
      Alpha1 and alpha2 are ISO/IEC 8859-1 encoded, which just means ASCII
      characters as far as I can tell, although the full encoding table for
      non-alphabetic characters is slightly different (I think).
      
      I imagine the alpha fields are just supposed to be alphabetic characters,
      but I haven't seen any network cards actually report a non-zero value for
      either.
      
      If people wrote software against this netlink behavior, and were parsing
      the major and minor versions themselves from the u32, then this would
      definitely break their code.
      
      [1] https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.0.0.pdf
      [2] https://en.wikipedia.org/wiki/Binary-coded_decimal
      [2] https://en.wikipedia.org/wiki/ISO/IEC_8859-1Signed-off-by: default avatarPeter Delevoryas <peter@pjd.dev>
      Fixes: 138635cc ("net/ncsi: NCSI response packet handler")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3084b58b
    • Peter Delevoryas's avatar
      net/ncsi: Simplify Kconfig/dts control flow · c797ce16
      Peter Delevoryas authored
      Background:
      
      1. CONFIG_NCSI_OEM_CMD_KEEP_PHY
      
      If this is enabled, we send an extra OEM Intel command in the probe
      sequence immediately after discovering a channel (e.g. after "Clear
      Initial State").
      
      2. CONFIG_NCSI_OEM_CMD_GET_MAC
      
      If this is enabled, we send one of 3 OEM "Get MAC Address" commands from
      Broadcom, Mellanox (Nvidida), and Intel in the *configuration* sequence
      for a channel.
      
      3. mellanox,multi-host (or mlx,multi-host)
      
      Introduced by this patch:
      
      https://lore.kernel.org/all/20200108234341.2590674-1-vijaykhemka@fb.com/
      
      Which was actually originally from cosmo.chou@quantatw.com:
      
      https://github.com/facebook/openbmc-linux/commit/9f132a10ec48db84613519258cd8a317fb9c8f1b
      
      Cosmo claimed that the Nvidia ConnectX-4 and ConnectX-6 NIC's don't
      respond to Get Version ID, et. al in the probe sequence unless you send
      the Set MC Affinity command first.
      
      Problem Statement:
      
      We've been using a combination of #ifdef code blocks and IS_ENABLED()
      conditions to conditionally send these OEM commands.
      
      It makes adding any new code around these commands hard to understand.
      
      Solution:
      
      In this patch, I just want to remove the conditionally compiled blocks
      of code, and always use IS_ENABLED(...) to do dynamic control flow.
      
      I don't think the small amount of code this adds to non-users of the OEM
      Kconfigs is a big deal.
      Signed-off-by: default avatarPeter Delevoryas <peter@pjd.dev>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c797ce16
    • David S. Miller's avatar
      Merge branch 'net-make-timestamping-selectable' · f9672265
      David S. Miller authored
      Kory Maincent says:
      
      ====================
      net: Make timestamping selectable
      
      Up until now, there was no way to let the user select the layer at
      which time stamping occurs. The stack assumed that PHY time stamping
      is always preferred, but some MAC/PHY combinations were buggy.
      
      This series updates the default MAC/PHY default timestamping and aims to
      allow the user to select the desired layer administratively.
      
      Changes in v2:
      - Move selected_timestamping_layer variable of the concerned patch.
      - Use sysfs_streq instead of strmcmp.
      - Use the PHY timestamp only if available.
      
      Changes in v3:
      - Expose the PTP choice to ethtool instead of sysfs.
        You can test it with the ethtool source on branch feature_ptp of:
        https://github.com/kmaincent/ethtool
      - Added a devicetree binding to select the preferred timestamp.
      
      Changes in v4:
      - Move on to ethtool netlink instead of ioctl.
      - Add a netdev notifier to allow packet trapping by the MAC in case of PHY
        time stamping.
      - Add a PHY whitelist to not break the old PHY default time-stamping
        preference API.
      
      Changes in v5:
      - Update to ndo_hwstamp_get/set. This bring several new patches.
      - Add few patches to make the glue.
      - Convert macb to ndo_hwstamp_get/set.
      - Add netlink specs description of new ethtool commands.
      - Removed netdev notifier.
      - Split the patches that expose the timestamping to userspace to separate
        the core and ethtool development.
      - Add description of software timestamping.
      - Convert PHYs hwtstamp callback to use kernel_hwtstamp_config.
      
      Changes in v6:
      - Few fixes from the reviews.
      - Replace the allowlist to default_timestamp flag to know which phy is
        using old API behavior.
      - Rename the timestamping layer enum values.
      - Move to a simple enum instead of the mix between enum and bitfield.
      - Update ts_info and ts-set in software timestamping case.
      
      Changes in v7:
      - Fix a temporary build error.
      - Link to v6: https://lore.kernel.org/r/20231019-feature_ptp_netnext-v6-0-71affc27b0e5@bootlin.com
      ====================
      Signed-off-by: default avatarKory Maincent <kory.maincent@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9672265