1. 14 Aug, 2023 40 commits
    • Rahul Rameshbabu's avatar
      net/mlx5e: Add recovery flow for tx devlink health reporter for unhealthy PTP SQ · 53b836a4
      Rahul Rameshbabu authored
      A new check for the tx devlink health reporter is introduced for
      determining when the PTP port timestamping SQ is considered unhealthy. If
      there are enough CQEs considered never to be delivered, the space that can
      be utilized on the SQ decreases significantly, impacting performance and
      usability of the SQ. The health reporter is triggered when the number of
      likely never delivered port timestamping CQEs that utilize the space of the
      PTP SQ is greater than 93.75% of the total capacity of the SQ. A devlink
      health reporter recover method is also provided for this specific TX error
      context that restarts the PTP SQ.
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      53b836a4
    • Rahul Rameshbabu's avatar
      net/mlx5e: Make tx_port_ts logic resilient to out-of-order CQEs · 3178308a
      Rahul Rameshbabu authored
      Use a map structure for associating CQEs containing port timestamping
      information with the appropriate skb. Track order of WQEs submitted using a
      FIFO. Check if the corresponding port timestamping CQEs from the lookup
      values in the FIFO are considered dropped due to time elapsed. Return the
      lookup value to a freelist after consuming the skb. Reuse the freed lookup
      in future WQE submission iterations.
      
      The map structure uses an integer identifier for the key and returns an skb
      corresponding to that identifier. Embed the integer identifier in the WQE
      submitted to the WQ for the transmit path when the SQ is a PTP (port
      timestamping) SQ. The embedded identifier can then be queried using a field
      in the CQE of the corresponding port timestamping CQ. In the port
      timestamping napi_poll context, the identifier is queried from the CQE
      polled from CQ and used to lookup the corresponding skb from the WQE submit
      path. The skb reference is removed from map and then embedded with the port
      HW timestamp information from the CQE and eventually consumed.
      
      The metadata freelist FIFO is an array containing integer identifiers that
      can be pushed and popped in the FIFO. The purpose of this structure is
      bookkeeping what identifier values can safely be used in a subsequent WQE
      submission and should not contain identifiers that have still not been
      reaped by processing a corresponding CQE completion on the port
      timestamping CQ.
      
      The ts_cqe_pending_list structure is a combination of an array and linked
      list. The array is pre-populated with the nodes that will be added and
      removed from the head of the linked list. Each node contains the unique
      identifier value associated with the values submitted in the WQEs and
      retrieved in the port timestamping CQEs. When a WQE is submitted, the node
      in the array corresponding to the identifier popped from the metadata
      freelist is added to the end of the CQE pending list and is marked as
      "in-use". The node is removed from the linked list under two conditions.
      The first condition is that the corresponding port timestamping CQE is
      polled in the PTP napi_poll context. The second condition is that more than
      a second has elapsed since the DMA timestamp value corresponding to the WQE
      submission. When the first condition occurs, the "in-use" bit in the linked
      list node is cleared, and the resources corresponding to the WQE submission
      are then released. The second condition, however, indicates that the port
      timestamping CQE will likely never be delivered. It's not impossible for
      the device to post a CQE after an infinite amount of time though highly
      improbable. In order to be resilient to this improbable case, resources
      related to the corresponding WQE submission are still kept, the identifier
      value is not returned to the freelist, and the "in-use" bit is cleared on
      the node to indicate that it's no longer part of the linked list of "likely
      to be delivered" port timestamping CQE identifiers. A count for the number
      of port timestamping CQEs considered highly likely to never be delivered by
      the device is maintained. This count gets decremented in the unlikely event
      a port timestamping CQE considered unlikely to ever be delivered is polled
      in the PTP napi_poll context.
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      3178308a
    • Rahul Rameshbabu's avatar
      net/mlx5: Consolidate devlink documentation in devlink/mlx5.rst · b608dd67
      Rahul Rameshbabu authored
      De-duplicate documentation by removing mellanox/mlx5/devlink.rst. Instead,
      only use the generic devlink documentation directory to document mlx5
      devlink parameters. Avoid providing general devlink tool usage information
      in mlx5-specific documentation.
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      b608dd67
    • Jakub Kicinski's avatar
      Merge branch 'devlink-introduce-selective-dumps' · f3cc0030
      Jakub Kicinski authored
      Jiri Pirko says:
      
      ====================
      devlink: introduce selective dumps
      
      Motivation:
      
      For SFs, one devlink instance per SF is created. There might be
      thousands of these on a single host. When a user needs to know port
      handle for specific SF, he needs to dump all devlink ports on the host
      which does not scale good.
      
      Solution:
      
      Allow user to pass devlink handle (and possibly other attributes)
      alongside the dump command and dump only objects which are matching
      the selection.
      
      Use split ops to generate policies for dump callbacks acccording to
      the attributes used for selection.
      
      The userspace can use ctrl genetlink GET_POLICY command to find out if
      the selective dumps are supported by kernel for particular command.
      
      Example:
      $ devlink port show
      auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
      auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
      
      $ devlink port show auxiliary/mlx5_core.eth.0
      auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
      
      $ devlink port show auxiliary/mlx5_core.eth.1
      auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
      
      Extension:
      
      patches #12 and #13 extends selection attributes by port index
      for health reporter dumping.
      ====================
      
      Link: https://lore.kernel.org/r/20230811155714.1736405-1-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f3cc0030
    • Jiri Pirko's avatar
      netlink: specs: devlink: extend health reporter dump attributes by port index · 0149bca1
      Jiri Pirko authored
      Allow user to pass port index for health reporter dump request.
      
      Re-generate the related code.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-14-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0149bca1
    • Jiri Pirko's avatar
      devlink: extend health reporter dump selector by port index · b03f13cb
      Jiri Pirko authored
      Introduce a possibility for devlink object to expose attributes it
      supports for selection of dumped objects.
      
      Use this by health reporter to indicate it supports port index based
      selection of dump objects. Implement this selection mechanism in
      devlink_nl_cmd_health_reporter_get_dump_one()
      
      Example:
      $ devlink health
      pci/0000:08:00.0:
        reporter fw
          state healthy error 0 recover 0 auto_dump true
        reporter fw_fatal
          state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
        reporter vnic
          state healthy error 0 recover 0
      pci/0000:08:00.0/32768:
        reporter vnic
          state healthy error 0 recover 0
      pci/0000:08:00.0/32769:
        reporter vnic
          state healthy error 0 recover 0
      pci/0000:08:00.0/32770:
        reporter vnic
          state healthy error 0 recover 0
      pci/0000:08:00.1:
        reporter fw
          state healthy error 0 recover 0 auto_dump true
        reporter fw_fatal
          state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
        reporter vnic
          state healthy error 0 recover 0
      pci/0000:08:00.1/98304:
        reporter vnic
          state healthy error 0 recover 0
      pci/0000:08:00.1/98305:
        reporter vnic
          state healthy error 0 recover 0
      pci/0000:08:00.1/98306:
        reporter vnic
          state healthy error 0 recover 0
      
      $ devlink health show pci/0000:08:00.0
      pci/0000:08:00.0:
        reporter fw
          state healthy error 0 recover 0 auto_dump true
        reporter fw_fatal
          state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
        reporter vnic
          state healthy error 0 recover 0
      pci/0000:08:00.0/32768:
        reporter vnic
          state healthy error 0 recover 0
      pci/0000:08:00.0/32769:
        reporter vnic
          state healthy error 0 recover 0
      pci/0000:08:00.0/32770:
        reporter vnic
          state healthy error 0 recover 0
      
      $ devlink health show pci/0000:08:00.0/32768
      pci/0000:08:00.0/32768:
        reporter vnic
          state healthy error 0 recover 0
      
      The last command is possible because of this patch.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-13-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b03f13cb
    • Jiri Pirko's avatar
      netlink: specs: devlink: extend per-instance dump commands to accept instance attributes · 34493336
      Jiri Pirko authored
      Extend per-instance dump command definitions to accept instance
      attributes. Allow parsing of devlink handle attributes so they could
      be used for instance selection.
      
      Re-generate the related code.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-12-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      34493336
    • Jiri Pirko's avatar
      devlink: allow user to narrow per-instance dumps by passing handle attrs · 4a1b5aa8
      Jiri Pirko authored
      For SFs, one devlink instance per SF is created. There might be
      thousands of these on a single host. When a user needs to know port
      handle for specific SF, he needs to dump all devlink ports on the host
      which does not scale good.
      
      Allow user to pass devlink handle attributes alongside the dump command
      and dump only objects which are under selected devlink instance.
      
      Example:
      $ devlink port show
      auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
      auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
      
      $ devlink port show auxiliary/mlx5_core.eth.0
      auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
      
      $ devlink port show auxiliary/mlx5_core.eth.1
      auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-11-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4a1b5aa8
    • Jiri Pirko's avatar
      devlink: remove converted commands from small ops · 833e479d
      Jiri Pirko authored
      As the commands are already defined in split ops, remove them
      from small ops.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-10-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      833e479d
    • Jiri Pirko's avatar
      devlink: remove duplicate temporary netlink callback prototypes · ddff2832
      Jiri Pirko authored
      Remove the duplicate temporary netlink callback prototype as the
      generated ones are already in place.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-9-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ddff2832
    • Jiri Pirko's avatar
      netlink: specs: devlink: add commands that do per-instance dump · 7199c862
      Jiri Pirko authored
      Add the definitions for the commands that do per-instance dump
      and re-generate the related code.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-8-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7199c862
    • Jiri Pirko's avatar
      devlink: pass flags as an arg of dump_one() callback · 7d3c6fec
      Jiri Pirko authored
      In order to easily set NLM_F_DUMP_FILTERED for partial dumps, pass the
      flags as an arg of dump_one() callback. Currently, it is always
      NLM_F_MULTI.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-7-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7d3c6fec
    • Jiri Pirko's avatar
      devlink: introduce dumpit callbacks for split ops · 24c8e56d
      Jiri Pirko authored
      Introduce dumpit callbacks for generated split ops. Have them
      as a thin wrapper around iteration function and allow to pass dump_one()
      function pointer directly without need to store in devlink_cmd structs.
      
      Note that the function prototypes are temporary until the generated ones
      will replace them in a follow-up patch.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-6-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      24c8e56d
    • Jiri Pirko's avatar
      devlink: rename doit callbacks for per-instance dump commands · 8fa995ad
      Jiri Pirko authored
      Rename netlink doit callback functions for the commands that do
      implement per-instance dump to match the generated names that are going
      to be introduce in the follow-up patch.
      
      Note that the function prototypes are temporary until the generated ones
      will replace them in a follow-up patch.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-5-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8fa995ad
    • Jiri Pirko's avatar
      devlink: introduce devlink_nl_pre_doit_port*() helper functions · ee6d78ac
      Jiri Pirko authored
      Define port handling helpers what don't rely on internal_flags.
      Have __devlink_nl_pre_doit() to accept the flags as a function arg and
      make devlink_nl_pre_doit() a wrapper helper function calling it.
      Introduce new helpers devlink_nl_pre_doit_port() and
      devlink_nl_pre_doit_port_optional() to be used by split ops in follow-up
      patch.
      
      Note that the function prototypes are temporary until the generated ones
      will replace them in a follow-up patch.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-4-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ee6d78ac
    • Jiri Pirko's avatar
      devlink: parse rate attrs in doit() callbacks · 41a1d4d1
      Jiri Pirko authored
      No need to give the rate any special treatment in netlink attributes
      parsing, as unlike for ports, there is only a couple of commands
      benefiting from that.
      
      Remove DEVLINK_NL_FLAG_NEED_RATE*, make pre_doit() callback simpler
      by moving the rate attributes parsing to rate_*_doit() ops.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-3-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      41a1d4d1
    • Jiri Pirko's avatar
      devlink: parse linecard attr in doit() callbacks · 63618463
      Jiri Pirko authored
      No need to give the linecards any special treatment in netlink attribute
      parsing, as unlike for ports, there is only a couple of commands
      benefiting from that.
      
      Remove DEVLINK_NL_FLAG_NEED_LINECARD, make pre_doit() callback simpler
      by moving the linecard attribute parsing to linecard_[gs]et_doit() ops.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230811155714.1736405-2-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      63618463
    • Gabor Juhos's avatar
      net: phy: Introduce PSGMII PHY interface mode · 83b5f025
      Gabor Juhos authored
      The PSGMII interface is similar to QSGMII. The main difference
      is that the PSGMII interface combines five SGMII lines into a
      single link while in QSGMII only four lines are combined.
      
      Similarly to the QSGMII, this interface mode might also needs
      special handling within the MAC driver.
      
      It is commonly used by Qualcomm with their QCA807x PHY series and
      modern WiSoC-s.
      
      Add definitions for the PHY layer to allow to express this type
      of connection between the MAC and PHY.
      Signed-off-by: default avatarGabor Juhos <j4g8y7@gmail.com>
      Signed-off-by: default avatarRobert Marko <robert.marko@sartura.hr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83b5f025
    • Robert Marko's avatar
      dt-bindings: net: ethernet-controller: add PSGMII mode · de875d35
      Robert Marko authored
      Add a new PSGMII mode which is similar to QSGMII with the difference being
      that it combines 5 SGMII lines into a single link compared to 4 on QSGMII.
      
      It is commonly used by Qualcomm on their QCA807x PHY series.
      Signed-off-by: default avatarRobert Marko <robert.marko@sartura.hr>
      Acked-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de875d35
    • David S. Miller's avatar
      Merge branch 'mlxsw-redirection' · 2d93c30c
      David S. Miller authored
      Petr Machata says:
      
      ====================
      mlxsw: Support traffic redirection from a locked bridge port
      
      Ido Schimmel writes:
      
      It is possible to add a filter that redirects traffic from the ingress
      of a bridge port that is locked (i.e., performs security / SMAC lookup)
      and has learning enabled. For example:
      
       # ip link add name br0 type bridge
       # ip link set dev swp1 master br0
       # bridge link set dev swp1 learning on locked on mab on
       # tc qdisc add dev swp1 clsact
       # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw src_ip 192.0.2.1 action mirred egress redirect dev swp2
      
      In the kernel's Rx path, this filter is evaluated before the Rx handler
      of the bridge, which means that redirected traffic should not be
      affected by bridge port configuration such as learning.
      
      However, the hardware data path is a bit different and the redirect
      action (FORWARDING_ACTION in hardware) merely attaches a pointer to the
      packet, which is later used by the L2 lookup stage to understand how to
      forward the packet. Between both stages - ingress ACL and L2 lookup -
      learning and security lookup are performed, which means that redirected
      traffic is affected by bridge port configuration, unlike in the kernel's
      data path.
      
      The learning discrepancy was handled in commit 577fa14d ("mlxsw:
      spectrum: Do not process learned records with a dummy FID") by simply
      ignoring learning notifications generated by the redirected traffic. A
      similar solution is not possible for the security / SMAC lookup since
      - unlike learning - the CPU is not involved and packets that failed the
      lookup are dropped by the device.
      
      Instead, solve this by prepending the ignore action to the redirect
      action and use it to instruct the device to disable both learning and
      the security / SMAC lookup for redirected traffic.
      
      Patch #1 adds the ignore action.
      
      Patch #2 prepends the action to the redirect action in flower offload
      code.
      
      Patch #3 removes the workaround in commit 577fa14d ("mlxsw:
      spectrum: Do not process learned records with a dummy FID") since it is
      no longer needed.
      
      Patch #4 adds a test case.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d93c30c
    • Ido Schimmel's avatar
      selftests: forwarding: Add test case for traffic redirection from a locked port · 38c43a1c
      Ido Schimmel authored
      Check that traffic can be redirected from a locked bridge port and that
      it does not create locked FDB entries.
      
      Cc: Hans J. Schultz <netdev@kapio-technology.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38c43a1c
    • Ido Schimmel's avatar
      mlxsw: spectrum: Stop ignoring learning notifications from redirected traffic · 9793a5a9
      Ido Schimmel authored
      As explained in the previous patch, with the ignore action prepended to
      the redirect action, it is not longer possible for redirected traffic to
      generate learning notifications.
      
      Therefore, remove the workaround that was added in commit 577fa14d
      ("mlxsw: spectrum: Do not process learned records with a dummy FID") as
      it is no longer needed.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9793a5a9
    • Ido Schimmel's avatar
      mlxsw: spectrum_flower: Disable learning and security lookup when redirecting · 0433670e
      Ido Schimmel authored
      It is possible to add a filter that redirects traffic from the ingress
      of a bridge port that is locked (i.e., performs security / SMAC lookup)
      and has learning enabled. For example:
      
       # ip link add name br0 type bridge
       # ip link set dev swp1 master br0
       # bridge link set dev swp1 learning on locked on mab on
       # tc qdisc add dev swp1 clsact
       # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw src_ip 192.0.2.1 action mirred egress redirect dev swp2
      
      In the kernel's Rx path, this filter is evaluated before the Rx handler
      of the bridge, which means that redirected traffic should not be
      affected by bridge port configuration such as learning.
      
      However, the hardware data path is a bit different and the redirect
      action (FORWARDING_ACTION in hardware) merely attaches a pointer to the
      packet, which is later used by the L2 lookup stage to understand how to
      forward the packet. Between both stages - ingress ACL and L2 lookup -
      learning and security lookup are performed, which means that redirected
      traffic is affected by bridge port configuration, unlike in the kernel's
      data path.
      
      The learning discrepancy was handled in commit 577fa14d ("mlxsw:
      spectrum: Do not process learned records with a dummy FID") by simply
      ignoring learning notifications generated by the redirected traffic. A
      similar solution is not possible for the security / SMAC lookup since
      - unlike learning - the CPU is not involved and packets that failed the
      lookup are dropped by the device.
      
      Instead, solve this by prepending the ignore action to the redirect
      action and use it to instruct the device to disable both learning and
      the security / SMAC lookup for redirected traffic.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0433670e
    • Ido Schimmel's avatar
      mlxsw: core_acl_flex_actions: Add IGNORE_ACTION · d0d449c7
      Ido Schimmel authored
      Add the IGNORE_ACTION which is used to ignore basic switching functions
      such as learning on a per-packet basis.
      
      The action will be prepended to the FORWARDING_ACTION in subsequent
      patches.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0d449c7
    • Furong Xu's avatar
      net: stmmac: xgmac: show more MAC HW features in debugfs · 58c1e0ba
      Furong Xu authored
      1. Show TSSTSSEL(Timestamp System Time Source),
      ADDMACADRSEL(additional MAC addresses), SMASEL(SMA/MDIO Interface),
      HDSEL(Half-duplex Support) in debugfs.
      2. Show exact number of additional MAC address registers for XGMAC2 core.
      3. XGMAC2 core does not have different IP checksum offload types, so just
      show rx_coe instead of rx_coe_type1 or rx_coe_type2.
      4. XGMAC2 core does not have rxfifo_over_2048 definition, skip it.
      Signed-off-by: default avatarFurong Xu <0x1207@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58c1e0ba
    • David S. Miller's avatar
      Merge branch 'net-stats-helpers' · a9142847
      David S. Miller authored
      Li Zetao says:
      
      ====================
      Use helper functions to update stats
      
      The patch set uses the helper functions dev_sw_netstats_rx_add() and
      dev_sw_netstats_tx_add() to update stats, which is the same as
      implementing the function separately.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9142847
    • Li Zetao's avatar
      vxlan: Use helper functions to update stats · 3c0930b4
      Li Zetao authored
      Use the helper functions dev_sw_netstats_rx_add() and
      dev_sw_netstats_tx_add() to update stats, which helps to
      provide code readability.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c0930b4
    • Li Zetao's avatar
      net: macsec: Use helper functions to update stats · bf98bbe9
      Li Zetao authored
      Use the helper functions dev_sw_netstats_rx_add() and
      dev_sw_netstats_tx_add() to update stats, which helps to
      provide code readability.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf98bbe9
    • William Tu's avatar
      vmxnet3: Add XDP support. · 54f00cce
      William Tu authored
      The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.
      
      Background:
      The vmxnet3 rx consists of three rings: ring0, ring1, and dataring.
      For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma
      mapped to the ring's descriptor. If LRO is enabled and packet size larger
      than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of
      the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is
      allocated using alloc_page. So for LRO packets, the payload will be in one
      buffer from r0 and multiple from r1, for non-LRO packets, only one
      descriptor in r0 is used for packet size less than 3k.
      
      When receiving a packet, the first descriptor will have the sop (start of
      packet) bit set, and the last descriptor will have the eop (end of packet)
      bit set. Non-LRO packets will have only one descriptor with both sop and
      eop set.
      
      Other than r0 and r1, vmxnet3 dataring is specifically designed for
      handling packets with small size, usually 128 bytes, defined in
      VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend
      driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in
      order to avoid memory mapping/unmapping overhead. In summary, packet size:
          A. < 128B: use dataring
          B. 128B - 3K: use ring0 (VMXNET3_RX_BUF_SKB)
          C. > 3K: use ring0 and ring1 (VMXNET3_RX_BUF_SKB + VMXNET3_RX_BUF_PAGE)
      As a result, the patch adds XDP support for packets using dataring
      and r0 (case A and B), not the large packet size when LRO is enabled.
      
      XDP Implementation:
      When user loads and XDP prog, vmxnet3 driver checks configurations, such
      as mtu, lro, and re-allocate the rx buffer size for reserving the extra
      headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be
      associated with every rx queue of the device. Note that when using dataring
      for small packet size, vmxnet3 (front-end driver) doesn't control the
      buffer allocation, as a result we allocate a new page and copy packet
      from the dataring to XDP frame.
      
      The receive side of XDP is implemented for case A and B, by invoking the
      bpf program at vmxnet3_rq_rx_complete and handle its returned action.
      The vmxnet3_process_xdp(), vmxnet3_process_xdp_small() function handles
      the ring0 and dataring case separately, and decides the next journey of
      the packet afterward.
      
      For TX, vmxnet3 has split header design. Outgoing packets are parsed
      first and protocol headers (L2/L3/L4) are copied to the backend. The
      rest of the payload are dma mapped. Since XDP_TX does not parse the
      packet protocol, the entire XDP frame is dma mapped for transmission
      and transmitted in a batch. Later on, the frame is freed and recycled
      back to the memory pool.
      
      Performance:
      Tested using two VMs inside one ESXi vSphere 7.0 machine, using single
      core on each vmxnet3 device, sender using DPDK testpmd tx-mode attached
      to vmxnet3 device, sending 64B or 512B UDP packet.
      
      VM1 txgen:
      $ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \
      --forward-mode=txonly --eth-peer=0,<mac addr of vm2>
      option: add "--txonly-multi-flow"
      option: use --txpkts=512 or 64 byte
      
      VM2 running XDP:
      $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode
      $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options>
      options: XDP_DROP, XDP_PASS, XDP_TX
      
      To test REDIRECT to cpu 0, use
      $ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
      
      Single core performance comparison with skb-mode.
      64B:      skb-mode -> native-mode
      XDP_DROP: 1.6Mpps -> 2.4Mpps
      XDP_PASS: 338Kpps -> 367Kpps
      XDP_TX:   1.1Mpps -> 2.3Mpps
      REDIRECT-drop: 1.3Mpps -> 2.3Mpps
      
      512B:     skb-mode -> native-mode
      XDP_DROP: 863Kpps -> 1.3Mpps
      XDP_PASS: 275Kpps -> 376Kpps
      XDP_TX:   554Kpps -> 1.2Mpps
      REDIRECT-drop: 659Kpps -> 1.2Mpps
      
      Demo: https://youtu.be/4lm1CSCi78Q
      
      Future work:
      - XDP frag support
      - use napi_consume_skb() instead of dev_kfree_skb_any at unmap
      - stats using u64_stats_t
      - using bitfield macro BIT()
      - optimization for DMA synchronization using actual frame length,
        instead of always max_len
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54f00cce
    • David S. Miller's avatar
      Merge branch 'ovs-drop-reasons' · 76fa3635
      David S. Miller authored
      Adrian Moreno says:
      
      ====================
      openvswitch: add drop reasons
      
      There is currently a gap in drop visibility in the openvswitch module.
      This series tries to improve this by adding a new drop reason subsystem
      for OVS.
      
      Apart from adding a new drop reasson subsystem and some common drop
      reasons, this series takes Eric's preliminary work [1] on adding an
      explicit drop action and integrates it into the same subsystem.
      
      A limitation of this series is that it does not report upcall errors.
      The reason is that there could be many sources of upcall drops and the
      most common one, which is the netlink buffer overflow, cannot be
      reported via kfree_skb() because the skb is freed in the netlink layer
      (see [2]). Therefore, using a reason for the rare events and not the
      common one would be even more misleading. I'd propose we add (in a
      follow up patch) a tracepoint to better report upcall errors.
      
      [1] https://lore.kernel.org/netdev/202306300609.tdRdZscy-lkp@intel.com/T/
      [2] commit 1100248a ("openvswitch: Fix double reporting of drops in dropwatch")
      
      ---
      v4 -> v5:
      - Rebased
      - Added a helper function to explicitly convert drop reason enum types
      
      v3 -> v4:
      - Changed names of errors following Ilya's suggestions
      - Moved the ovs-dpctl.py changes from patch 7/7 to 3/7
      - Added a test to ensure actions following a drop are rejected
      
      rfc2 -> v3:
      - Rebased on top of latest net-next
      
      rfc1 -> rfc2:
      - Fail when an explicit drop is not the last
      - Added a drop reason for action errors
      - Added braces around macros
      - Dropped patch that added support for masks in ovs-dpctl.py as it's now
        included in Aaron's series [2].
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76fa3635
    • Adrian Moreno's avatar
      selftests: openvswitch: add explicit drop testcase · 42420291
      Adrian Moreno authored
      Test explicit drops generate the right drop reason. Also, verify that
      the kernel rejects flows with actions following an explicit drop.
      Acked-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarAdrian Moreno <amorenoz@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42420291
    • Adrian Moreno's avatar
      selftests: openvswitch: add drop reason testcase · aab1272f
      Adrian Moreno authored
      Test if the correct drop reason is reported when OVS drops a packet due
      to an explicit flow.
      Acked-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarAdrian Moreno <amorenoz@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aab1272f
    • Adrian Moreno's avatar
      net: openvswitch: add misc error drop reasons · 43d95b30
      Adrian Moreno authored
      Use drop reasons from include/net/dropreason-core.h when a reasonable
      candidate exists.
      Acked-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarAdrian Moreno <amorenoz@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43d95b30
    • Adrian Moreno's avatar
      net: openvswitch: add meter drop reason · f329d1bc
      Adrian Moreno authored
      By using an independent drop reason it makes it easy to distinguish
      between QoS-triggered or flow-triggered drop.
      Acked-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarAdrian Moreno <amorenoz@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f329d1bc
    • Eric Garver's avatar
      net: openvswitch: add explicit drop action · e7bc7db9
      Eric Garver authored
      From: Eric Garver <eric@garver.life>
      
      This adds an explicit drop action. This is used by OVS to drop packets
      for which it cannot determine what to do. An explicit action in the
      kernel allows passing the reason _why_ the packet is being dropped or
      zero to indicate no particular error happened (i.e: OVS intentionally
      dropped the packet).
      
      Since the error codes coming from userspace mean nothing for the kernel,
      we squash all of them into only two drop reasons:
      - OVS_DROP_EXPLICIT_WITH_ERROR to indicate a non-zero value was passed
      - OVS_DROP_EXPLICIT to indicate a zero value was passed (no error)
      
      e.g. trace all OVS dropped skbs
      
       # perf trace -e skb:kfree_skb --filter="reason >= 0x30000"
       [..]
       106.023 ping/2465 skb:kfree_skb(skbaddr: 0xffffa0e8765f2000, \
        location:0xffffffffc0d9b462, protocol: 2048, reason: 196611)
      
      reason: 196611 --> 0x30003 (OVS_DROP_EXPLICIT)
      
      Also, this patch allows ovs-dpctl.py to add explicit drop actions as:
        "drop"     -> implicit empty-action drop
        "drop(0)"  -> explicit non-error action drop
        "drop(42)" -> explicit error action drop
      Signed-off-by: default avatarEric Garver <eric@garver.life>
      Co-developed-by: default avatarAdrian Moreno <amorenoz@redhat.com>
      Signed-off-by: default avatarAdrian Moreno <amorenoz@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7bc7db9
    • Adrian Moreno's avatar
      net: openvswitch: add action error drop reason · ec7bfb5e
      Adrian Moreno authored
      Add a drop reason for packets that are dropped because an action
      returns a non-zero error code.
      Acked-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarAdrian Moreno <amorenoz@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec7bfb5e
    • Adrian Moreno's avatar
      net: openvswitch: add last-action drop reason · 9d802da4
      Adrian Moreno authored
      Create a new drop reason subsystem for openvswitch and add the first
      drop reason to represent last-action drops.
      
      Last-action drops happen when a flow has an empty action list or there
      is no action that consumes the packet (output, userspace, recirc, etc).
      It is the most common way in which OVS drops packets.
      
      Implementation-wise, most of these skb-consuming actions already call
      "consume_skb" internally and return directly from within the
      do_execute_actions() loop so with minimal changes we can assume that
      any skb that exits the loop normally is a packet drop.
      Signed-off-by: default avatarAdrian Moreno <amorenoz@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d802da4
    • David S. Miller's avatar
      Merge branch 'mptcp-remove-msk-subflow' · afb0c192
      David S. Miller authored
      Matthieu Baerts says:
      
      ====================
      mptcp: get rid of msk->subflow
      
      The MPTCP protocol maintains an additional struct socket per connection,
      mainly to be able to easily use tcp-level struct socket operations.
      
      This leads to several side effects, beyond the quite unfortunate /
      confusing 'subflow' field name:
      
      - active and passive sockets behaviour is inconsistent: only active ones
        have a not NULL msk->subflow, leading to different error handling and
        different error code returned to the user-space in several places.
      
      - active sockets uses an unneeded, larger amount of memory
      
      - passive sockets can't successfully go through accept(), disconnect(),
        accept() sequence, see [1] for more details.
      
      The 13 first patches of this series are from Paolo and address all the
      above, finally getting rid of the blamed field:
      
      - The first patch is a minor clean-up.
      
      - In the next 11 patches, msk->subflow usage is systematically removed
        from the MPTCP protocol, replacing it with direct msk->first usage,
        eventually introducing new core helpers when needed.
      
      - The 13th patch finally disposes the field, and it's the only patch in
        the series intended to produce functional changes.
      
      The last and 14th patch is from Kuniyuki and it is not linked to the
      previous ones: it is a small clean-up to get rid of an unnecessary check
      in mptcp_init_sock().
      
      [1] https://github.com/multipath-tcp/mptcp_net-next/issues/290
      ====================
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      afb0c192
    • Kuniyuki Iwashima's avatar
      mptcp: Remove unnecessary test for __mptcp_init_sock() · e2636917
      Kuniyuki Iwashima authored
      __mptcp_init_sock() always returns 0 because mptcp_init_sock() used
      to return the value directly.
      
      But after commit 18b683bf ("mptcp: queue data for mptcp level
      retransmission"), __mptcp_init_sock() need not return value anymore.
      
      Let's remove the unnecessary test for __mptcp_init_sock() and make
      it return void.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2636917
    • Paolo Abeni's avatar
      mptcp: get rid of msk->subflow · 39880bd8
      Paolo Abeni authored
      Such field is now unused just as a flag to control the first subflow
      deletion at close() time. Introduce a new bit flag for that and finally
      drop the mentioned field.
      
      As an intended side effect, now the first subflow sock is not freed
      before close() even for passive sockets. The msk has no open/active
      subflows if the first one is closed and the subflow list is singular,
      update accordingly the state check in mptcp_stream_accept().
      
      Among other benefits, the subflow removal, reduces the amount of memory
      used on the client side for each mptcp connection, allows passive sockets
      to go through successful accept()/disconnect()/connect() and makes return
      error code consistent for failing both passive and active sockets.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/290Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39880bd8